Shared Ride Efficiency: Data Wrangling

With the ever increasing amount of big data available and the development of dynamic route optimization algorithm, low cost shared ride becomes more and more popular. Uber and Lyft both offer shared ride service, UberPool and Lyft Line, in order to best capture the benefit of shareconomy. Via, an on-demand shared ride start up, offers flat-rate shuttle service in urban areas and has recently expanded its business to Brooklyn.

Here, I explored NYC taxi dataset of year 2015 from Google BigQuery,  started with a big picture analysis of NYC taxi services, analyzed the features of NYC taxi trips, and then focused on Brooklyn local trips. I built a simplified model (rather than a sophisticated dynamic TSP algorithm) to assess shared ride efficiency in Brooklyn. I discovered that over 15% trips within Brooklyn are shareable on late weekend night. Shared ride efficiency largely depends on the total number of trips, emphasizing the importance of scale in shareconomy.

Codes can be found here.

NYC taxi data from Google BigQuery

To get NYC taxi data in 2015, I use SQL to query taxi records from Google BigQuery (dataset: `bigquery-public-data.new_york.tlc_green_trips_2015` and `bigquery-public-data.new_york.tlc_yellow_trips_2015`)

I set the coordinate boundaries, as indicated in the map below, and consider trips traveling within NYC. Without setting the boundary, I observed taxi pickup and dropoff in the middle west, which were irrelevant in this analysis. I include trips with positive trip distance and trip time, and make sure that a trip’s pickup coordinate is not the same as dropoff coordinate. I round coordinates to 4 digits with spatial accuracy of 11 meters, which allow me to aggregate by coordinates to visualize trip density on map.

        SECOND) trip_time,
ROUND(pickup_longitude, 4) as pickup_long,ROUND(pickup_latitude, 4) as pickup_lat, 
ROUND(dropoff_longitude, 4) as dropoff_long,ROUND(dropoff_latitude, 4) as dropoff_lat,
FROM `bigquery-public-data.new_york.tlc_green_trips_2015`
trip_distance > 0
AND dropoff_datetime > pickup_datetime
AND passenger_count IS NOT NULL
AND NOT ((pickup_longitude = dropoff_longitude) AND (dropoff_latitude = dropoff_latitude))
AND pickup_longitude BETWEEN -74.2563 AND -73.6543
AND pickup_latitude BETWEEN 40.4941 AND 40.9313
AND dropoff_longitude BETWEEN -74.2563 AND -73.6543
AND dropoff_latitude BETWEEN 40.4941 AND 40.9313
SELECT pickup_long,pickup_lat, count(*) as pickup_num
FROM [bigquery-159705:nyc_taxi_2015.2015_green_datetime_output]
GROUP BY pickup_long, pickup_lat



I use SQL to extract detailed time information from datetime: “hour”, “dayofmonth”, “dayofyear”, “dayofweek”, “month”, which are used later for aggregation of trip count by time.

    pickup_datetime) hour,
    pickup_datetime) dayofmonth, 
    pickup_datetime) dayofyear,  
    pickup_datetime) dayofweek,  
    pickup_datetime) month
FROM `bigquery-159705.nyc_taxi_2015.2015_green_output`
SELECT hour,dayofweek,day,month, count(*) as pickup_num
FROM [bigquery-159705:nyc_taxi_2015.2015_green_datetime_output]
GROUP BY hour,dayofweek,day,month

To extract data sample from a single day of interest, I use “dayofyear” as the selection condition.

SELECT pickup_datetime,trip_time,trip_distance,pickup_long,pickup_lat,dropoff_long,dropoff_lat, passenger_count,hour,dayofweek	
FROM [bigquery-159705:nyc_taxi_2015.2015_green_time_output] 
WHERE dayofyear=102

I then stored data in Google Cloud and download the data as csv.

Uber Data

Another dataset that I used is Uber data (uber-raw-data-janjune-15.csv), which is available from fivethirtyeight.

Data Cleaning

I use Python Pandas library to perform data wrangling, and audit data quality for both taxi data and Uber data (Jupyter Notebook can be found here).

A few modifications to mention:

1.  I use Python Pandas library to load and process csv data.

import pandas as pd
import numpy as np
green_time_df = pd.read_csv('green%2F2015_green_by_time_02.csv')

2.  There are no missing values in my data.


3.  SQL dayofweek by default sets Sunday as 1 and Saturday as 5, while Python’s dayofweek sets Monday as 0 and Sunday as 6. To make the format consistent, I set Monday as 1 and Sunday as 7 for all of my data.

green_time_df['dayofweek']= np.where(green_time_df['dayofweek']== 0,7,green_time_df['dayofweek'])

4.  To get weekly data, I extract the week number “weekofyear” for 2015.

def get_week_of_year(day):
 new_day = day + 2
 return int(new_day*1./7)+1
green_time_df['weekofyear'] = green_time_df['dayofyear'].apply(get_week_of_year)

5.  I group data by month, week, dayofweek, hour to understand the timely total or average for yellow and green taxis.

green_by_month = green_time_df.groupby(['month'],as_index=False).sum()[['month','trip_num']]

6. In later analysis of shared ride efficiency, I focus on the week 15 in 2015, which ranges from 2015-04-06 to 2015-04-12, and use Taxi and Uber data in this week. The reason why I use week 15 is because from my initial data exploration, week 15 appears to be representative. In addition, I use Green Taxi data and Uber data for the analysis of Brooklyn trips. I will elaborate the difference between Green and Yellow Taxis trips in the following session.

mask = (df['Pickup_date'] > '2015-04-05') & (df['Pickup_date'] <= '2015-04-12')
df2 = df.loc[mask].copy()

7. In order to understand local trip density, I decide to convert coordinates to county, neighborhood, and zip code. I use Geocoder API to get the address information and parse it into zip code, county, neighborhood. API query is quite slow (2 seconds per query), I decide to focus on single days in week 15 from Green Taxi data: I use Wednesday to represent a weekday, and Saturday, Sunday to represent weekends.  Still, it takes a few days to get all data processed.

from geopy.geocoders import Nominatim
import re

df['pickup_long'] = df['pickup_long'].astype(str)
df['pickup_lat'] = df['pickup_lat'].astype(str)
df['dropoff_long'] = df['dropoff_long'].astype(str)
df['dropoff_lat'] = df['dropoff_lat'].astype(str)
df['pickup_coord']=df[['pickup_lat', 'pickup_long']].apply(lambda x: ",".join(x), axis=1)
df['dropoff_coord']=df[['dropoff_lat', 'dropoff_long']].apply(lambda x: ",".join(x), axis=1)

geolocator = Nominatim()

def get_geocode(coord):
    addr = geolocator.reverse(coord, timeout=60).address
    if addr == None:
        zipcode = 'NaN'
        zone = 'NaN'
        county = 'NaN'
        strs = [x.strip() for x in addr.split(',')]
        num = str(re.findall(r"\D(\d{5})\D", addr))
        num = str(filter(str.isdigit, num))[:5]
        zipcode = num
        county = strs[-5]
        if strs[-6]=='Manhattan' or strs[-6]=='BK':
            zone = strs[-7]
            zone = strs[-6]
    return zipcode+','+zone + ',' + county

def get_zipcode(zone):
    return zone.split(',')[0]
def get_neighborhood(zone):
    return zone.split(',')[1]
def get_county(zone):
    return zone.split(',')[2]

def add_fields(df):
    df['pickup_zipcode'] = df['pickup_zone'].apply(get_zipcode)
    df['pickup_neighborhood'] = df['pickup_zone'].apply(get_neighborhood)
    df['pickup_county'] = df['pickup_zone'].apply(get_county)
    df['dropoff_zipcode'] = df['dropoff_zone'].apply(get_zipcode)
    df['dropoff_neighborhood'] = df['dropoff_zone'].apply(get_neighborhood)
    df['dropoff_county'] = df['dropoff_zone'].apply(get_county)


8. Uber trip records do not show pickup coordinates, so I merge the trip table with lookup table with county and neighborhood information.

df_zone = df2.merge(zone,left_on='locationID',right_on='LocationID',how='left')

Data Exploration and Visualization

I use Tableau, Pandas, and matlibplot to perform data exploration and visualization. Check my next post.


[1] Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance,

[2] The Data Science of NYC Taxi Trips: An Analysis & Visualization,

[3] Measuring accuracy of latitude and longitude?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.