In my previous post, I presented interesting trips patterns of NYC taxis. Here, I focus my analysis of Brooklyn. Brooklyn is a very popular place of interest for both tourists and local people to eat, drink, and have fun. I usually take subway train L or G to visit Brooklyn. However, L train is to be shut down in 2019 for 1.5 years [1], which may severely affect Brooklyn’s business. Taxi trips, in particular, shared taxi trips, may become the mainstream transportation in Brooklyn.

To understand possible shared trip, I grouped pickup and dropoff locations by zipcode, neighborhood, and county. In order to do this, I first converted coordinates to address using Geocoder API for green taxi. Since the API is slow, I used single days (Wednesday, Saturday, Sunday) to represent weekday and weekends. I also included Uber data in this analysis, which include the pickup neighborhood information. Check my first post of data wrangling here.

Table of Contents

Trip within Brooklyn

I first looked at trips starting and ending within Brooklyn. Such trips are defined as local trip and they usually have short distances of 2 miles and travel time 10 minutes.

Weekday (Wednesday) pickup in Brooklyn peaked at 9 pm. Saturday pickup peaked at midnight, starting to increase in late afternoon, and lasted to Sunday morning. Sunday however, had a lower pickup late night (lower than Wednesday).

Where did taxi pick up and drop off passengers?

Animated pickup map by hour

Saturday Pickup in Brooklyn

The spatial temporal pattern revealed that evening and late weekend night were the busiest time for certain Brooklyn areas.

Top 10 zip codes for pickups are 11231, 11205, 11216, 11222, 11238, 11215, 11249, 11217, 11211, 11201

Top 10 zip codes for drop off are 11201, 11211, 11217, 11249, 11215, 11238, 11222, 11216, 11205, 11231

Top Neighborhoods

(image url https://en.wikipedia.org/wiki/File:Brooklyn_neighborhoods_map.png)

When I grouped trips by neighborhood names, the popular spots became familiar. Williamsburg, Greenpoint, Park Slope, Clinton Hill were hot pickup and drop off spots. In particular, Williamsburg led the 2nd popular spot by 100%.

Uber is also popular in these neighborhoods

I looked at Uber’s pick up records on the same days.

Similar to Green Taxi, Uber trip in Brooklyn also had a morning peak and a evening peak on Weekday. Yet trips droped in the middle of the day. On Saturday (Friday) overnight and Sunday overnight, there were more trips, similar as Green taxi.

Top neighborhoods of Uber, again, include Williamsburg, Park Slope, Greenpoint.

Popular Routes

Now that we know most trips start and end at similar hot spots, I then investigated popular routes in Brooklyn. If some routes are very popular, it may be possible to redesign bus/shuttle routes. Also, if passengers share the same routes, they may also share taxi rides, which only only help passengers to save money, but also increase the efficiency of taxi rides.

I started with Green taxi Saturday data and grouped trips by pickup and dropoff zip codes.

sat_route_zipcode = sat_bk_bk.groupby(['pickup_zipcode','dropoff_zipcode'],as_index=False).count()[['pickup_zipcode','dropoff_zipcode','pickup_datetime']]
sat_bk_bk_distance = sat_bk_bk.groupby(['pickup_zipcode','dropoff_zipcode'],as_index=False).mean()[['pickup_zipcode','dropoff_zipcode','trip_distance']]
sat_routes = sat_route_zipcode.merge(sat_bk_bk_distance)
sat_routes.sort_values('pickup_datetime',ascending=False,inplace=True)
sat_routes.rename(columns = {'pickup_datetime':'trip number'}, inplace=True)
sat_routes

	pickup_zipcode	dropoff_zipcode	trip number	trip_distance
1	11201	11201	542	0.989410
254	11211	11211	523	0.934589
265	11211	11222	517	1.243926
937	11249	11211	415	1.054024
540	11222	11211	353	1.300425
249	11211	11206	326	1.619724
948	11249	11222	326	1.163926
29	11201	11231	322	1.530280
16	11201	11217	320	1.289594
352	11215	11215	316	1.091582
279	11211	11249	286	1.002483
423	11217	11215	248	1.354758
962	11249	11249	234	0.743547
410	11217	11201	223	1.339327
4	11201	11205	221	1.522489
14	11201	11215	212	2.279104
354	11215	11217	197	1.170964
549	11222	11222	196	0.937908
264	11211	11221	194	2.662938
36	11201	11238	192	2.123646
563	11222	11249	187	1.270321
445	11217	11238	178	1.214775
932	11249	11206	176	2.181420
719	11231	11231	166	1.214699
10	11201	11211	165	3.591879
694	11231	11201	160	1.537688
246	11211	11201	156	3.810256
107	11206	11211	153	1.360523
425	11217	11217	144	0.840000
64	11205	11201	142	1.552254
...	...	...	...	...
486	11220	11214	1	3.120000
485	11220	11211	1	9.320000
482	11220	11205	1	8.690000
480	11220	11201	1	4.920000
478	11219	11230	1	3.300000
476	11219	11218	1	4.300000
554	11222	11228	1	11.690000
566	11223	11214	1	1.100000
625	11225	11234	1	4.030000
567	11223	11219	1	1.930000
618	11225	11224	1	2.050000
615	11225	11220	1	4.470000
606	11225	11209	1	7.450000
599	11224	11239	1	9.500000
597	11224	11236	1	8.080000
592	11224	11230	1	4.020000
585	11224	11221	1	3.430000
584	11224	11220	1	5.740000
582	11224	11212	1	10.200000
581	11224	11211	1	15.680000
580	11224	11210	1	6.100000
578	11224	11207	1	10.640000
577	11224	11205	1	15.100000
576	11224	11204	1	4.090000
575	11224	11201	1	12.580000
573	11223	11232	1	8.400000
572	11223	11230	1	2.050000
570	11223	11228	1	3.870000
569	11223	11226	1	4.350000
976	11417	11212	1	1.600000

From the list of popular routes above, I noticed that these trips had same or similar pick up and drop off zip codes, and short trip distance < 2 miles.

I defined shareable trips as trips with pickup time within 5 minutes and with same pickup and dropoff zipcodes. If there are more than 1 trip, number of shareable trips = total passenger number / 6, while 6 is the capability of each taxi. Otherwise, the number of non-shareable trip is 1, for each route and time window. Then for each pickup zipcode, I count the total number of shareable trips and non-shareable trips. Shared ride efficiency for each zipcode = #shareable trips / (#shareable+#non-shareable trips). Hourly ride efficiency is calculated as the mean efficiency across all zip codes.

1 Group by pickup and dropoff zipcode (namely route), and count total trip number, total passenger number, aggr trip number, and non aggr trip number. If total trip number of a certain route is 1, aggr is 0. Otherwise, divided the total number of passengers by 6 given that 6 is the capacity of a taxi. Non aggr trip is 0 when there is aggr trip, otherwise, it is 1.

def aggr_trip(df):
    df_gp = df.groupby(['pickup_zipcode','dropoff_zipcode'],as_index=False)
    df_trip_num = df_gp.count()[['pickup_zipcode','dropoff_zipcode','pickup_datetime']]
    df_passenger_num = df_gp.sum()[['pickup_zipcode','dropoff_zipcode','passenger_count']]
    df_eff = df_trip_num.merge(df_passenger_num)
    df_eff.rename(columns = {'pickup_datetime':'trip number'}, inplace=True)
    df_eff['aggr trip number'] = np.where(df_eff['trip number'] == 1,0,np.ceil(1.*df_eff['passenger_count']/6))
    df_eff['non aggr trip number'] = np.where(df_eff['trip number'] == 1,1,0)
    return df_eff

2.Calculate shared ride percentage across all areas

def aggr_eff(df):
    df_eff = aggr_trip(df)
    df2 = df_eff.groupby('pickup_zipcode',as_index=False).sum()[['pickup_zipcode','passenger_count','aggr trip number','non aggr trip number']]
    df2['aggr eff'] = 1.*df2['aggr trip number'] / (df2['aggr trip number'] + df2['non aggr trip number'] ) 
    return df2

3 Average across all routes

def aggr_eff_all(df):
    if df.empty:
        return 0
    else:
        df2 = aggr_eff(df)
        return df2['aggr eff'].mean()

4. ..

Overall, I defined hourly shared ride efficiency.

From my calculation, I reported that over 15% of trips on Saturday after 4 pm to Sunday 4 am are shareable. Over 10% of trips on Wednesday morning during rush hour are shareable.

Using similar analysis used above, I also analyzed trips between Brooklyn and Manhattan.

Total Hourly Trip and Shared ride efficiency

From my definition and calculation of hourly shared ride efficiency, it seems that shared ride efficiency should be correlated to total number of trips. The more trips there are, the more likely shared routes exist.

I plotted hourly shared ride efficiency and total hourly trip number of each day, and find that in general, share ride efficiency changes as total hourly trip number changes. There are several peaks in hourly share ride efficiency (red). Sat 1 am, 2 am, 8 pm; Sun 1-2 am (prolonged peak at midnight) ; Wednesday 3 am and 9 am. These peaks may suggest that at certain time, passengers are more likely go to similar direction.

There are fewer trips from Brooklyn to Manhattan compared with trips within Brooklyn and the shared ride efficiency is quite low on Wednesday. On weekend, shared ride efficiency is above 0.06 only after 7 pm Sat to early Sunday. There is a peak on Sunday afternoon to Manhattan.

Linear Correlated?

To understand the correlation between total hourly trip and hourly shared efficiency, I first calculated the pearson correlation coefficient from 3 days (sat,sun,wed), 2 direcition (bk-bk, bk-man), and reported 0.899934, suggesting they are quite highly correlated. Again we can see there are fewer number of Brooklyn to Manhattan trips.

From the above analysis, I realize hourly shared ride efficiency is related to day of the week, hour of the day, destination (bk or man), as well as the total hourly trip. I then built a linear regression model to fit the data and try to predict a hourly shared ride efficiency of any given time.

def add_field(df,col1,col2):
    df['to'] = col1
    df['day'] = col2

# use numerical data to encode date and destination
add_field(sat_bk_bk_df,0,1)
add_field(sun_bk_bk_df,0,2)
add_field(wed_bk_bk_df,0,3)
add_field(sat_bk_man_df,1,1)
add_field(sun_bk_man_df,1,2)
add_field(wed_bk_man_df,1,3)

Sample size: 144 ( 3 days, 24 hour a day, 2 destination)

Features are ‘hour’, ‘to’, ‘day’

The label is ‘hourly_aggr_efficiency’

	hourly_aggr_efficiency	total_hourly_trip	hour	to	day
0	0.183372	1694	0	0	1
1	0.192233	1545	1	0	1
2	0.134911	1310	2	0	1
3	0.149407	1045	3	0	1
…	…	…	…	…	…

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=42)
reg = linear_model.LinearRegression()
reg.fit(X_train,y_train)

Using reg.score(X_test,y_test), I reported 0.82 R^2 score, suggesting a good fit of the data.

y_pred = cross_val_predict(reg, X_test, y_test,cv=5)

fig, ax = plt.subplots()
ax.scatter(y_test, y_pred,alpha=0.5)
ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

Last but not the least, I used cross validation to show the comparison of predicted value v.s. actual test value. As we can see, at lower shared ride efficiency, the predicted value seems to overestimate while as shared ride efficiency increase, the model seems to underestimate the value. Overall, the regression model provides us a rough idea about the shareability of Broolyn Trips.

References:

[1] L train service between Brooklyn and Manhattan to be shut down for 18 months starting in 2019, http://www.nydailynews.com/new-york/train-shut-18-months-starting-2019-article-1.2725118

Ju Yang

Ph.D. / Machine Learning Practitioner in New York

Shared Ride Efficiency in Brooklyn