Explore Bike Share Data - John Bailey

For this project, we are using R Studio and our goal was to ask and answer three questions about the available bikeshare data from Washington, Chicago, and New York.

In [237]:
ny = read.csv('new_york_city.csv')
wash = read.csv('washington.csv')
chi = read.csv('chicago.csv')
In [238]:
head(ny)
XStart.TimeEnd.TimeTrip.DurationStart.StationEnd.StationUser.TypeGenderBirth.Year
5688089 2017-06-11 14:55:05 2017-06-11 15:08:21 795 Suffolk St & Stanton St W Broadway & Spring St Subscriber Male 1998
4096714 2017-05-11 15:30:11 2017-05-11 15:41:43 692 Lexington Ave & E 63 St 1 Ave & E 78 St Subscriber Male 1981
2173887 2017-03-29 13:26:26 2017-03-29 13:48:31 1325 1 Pl & Clinton St Henry St & Degraw St Subscriber Male 1987
3945638 2017-05-08 19:47:18 2017-05-08 19:59:01 703 Barrow St & Hudson St W 20 St & 8 Ave Subscriber Female 1986
6208972 2017-06-21 07:49:16 2017-06-21 07:54:46 329 1 Ave & E 44 St E 53 St & 3 Ave Subscriber Male 1992
1285652 2017-02-22 18:55:24 2017-02-22 19:12:03 998 State St & Smith St Bond St & Fulton St Subscriber Male 1986
In [239]:
head(wash)
XStart.TimeEnd.TimeTrip.DurationStart.StationEnd.StationUser.Type
1621326 2017-06-21 08:36:34 2017-06-21 08:44:43 489.066 14th & Belmont St NW 15th & K St NW Subscriber
482740 2017-03-11 10:40:00 2017-03-11 10:46:00 402.549 Yuma St & Tenley Circle NW Connecticut Ave & Yuma St NW Subscriber
1330037 2017-05-30 01:02:59 2017-05-30 01:13:37 637.251 17th St & Massachusetts Ave NW 5th & K St NW Subscriber
665458 2017-04-02 07:48:35 2017-04-02 08:19:03 1827.341 Constitution Ave & 2nd St NW/DOL M St & Pennsylvania Ave NW Customer
1481135 2017-06-10 08:36:28 2017-06-10 09:02:17 1549.427 Henry Bacon Dr & Lincoln Memorial Circle NW Maine Ave & 7th St SW Subscriber
1148202 2017-05-14 07:18:18 2017-05-14 07:24:56 398.000 1st & K St SE Eastern Market Metro / Pennsylvania Ave & 7th St SE Subscriber
In [240]:
head(chi)
XStart.TimeEnd.TimeTrip.DurationStart.StationEnd.StationUser.TypeGenderBirth.Year
1423854 2017-06-23 15:09:32 2017-06-23 15:14:53 321 Wood St & Hubbard St Damen Ave & Chicago Ave Subscriber Male 1992
955915 2017-05-25 18:19:03 2017-05-25 18:45:53 1610 Theater on the Lake Sheffield Ave & Waveland Ave Subscriber Female 1992
9031 2017-01-04 08:27:49 2017-01-04 08:34:45 416 May St & Taylor St Wood St & Taylor St Subscriber Male 1981
304487 2017-03-06 13:49:38 2017-03-06 13:55:28 350 Christiana Ave & Lawrence Ave St. Louis Ave & Balmoral Ave Subscriber Male 1986
45207 2017-01-17 14:53:07 2017-01-17 15:02:01 534 Clark St & Randolph St Desplaines St & Jackson Blvd Subscriber Male 1975
1473887 2017-06-26 09:01:20 2017-06-26 09:11:06 586 Clinton St & Washington Blvd Canal St & Taylor St Subscriber Male 1990

Question 1

Popular times of travel - What is the most common month for traveling users

In [241]:
library(ggplot2)

First, we will need to extract the month from the Start.Time field in each of our 3 datasets.

New York
In [242]:
ny$month <- format(as.Date(ny$Start.Time, format="%Y-%m-%d"),"%m") ##creating new column in dataframe and extracting month from Start.Time
In [243]:
head(ny) ##confirming new column created and extracted month
XStart.TimeEnd.TimeTrip.DurationStart.StationEnd.StationUser.TypeGenderBirth.Yearmonth
5688089 2017-06-11 14:55:05 2017-06-11 15:08:21 795 Suffolk St & Stanton St W Broadway & Spring St Subscriber Male 1998 06
4096714 2017-05-11 15:30:11 2017-05-11 15:41:43 692 Lexington Ave & E 63 St 1 Ave & E 78 St Subscriber Male 1981 05
2173887 2017-03-29 13:26:26 2017-03-29 13:48:31 1325 1 Pl & Clinton St Henry St & Degraw St Subscriber Male 1987 03
3945638 2017-05-08 19:47:18 2017-05-08 19:59:01 703 Barrow St & Hudson St W 20 St & 8 Ave Subscriber Female 1986 05
6208972 2017-06-21 07:49:16 2017-06-21 07:54:46 329 1 Ave & E 44 St E 53 St & 3 Ave Subscriber Male 1992 06
1285652 2017-02-22 18:55:24 2017-02-22 19:12:03 998 State St & Smith St Bond St & Fulton St Subscriber Male 1986 02
In [244]:
qplot(x = month, data = ny, color = I('black'), fill = I('#F79420'), xlab = 'Month', ylab = 'Number of Rides') #plotting Month
In [245]:
table(ny$month) ##using table to get summary info
   01    02    03    04    05    06 
 5745  6364  5820 10661 12180 14000 
Washington
In [246]:
wash$month <- format(as.Date(wash$Start.Time, format="%Y-%m-%d"),"%m") #extracting month for wash dataframe
In [247]:
head(wash) #verifying extraction of month and new column
XStart.TimeEnd.TimeTrip.DurationStart.StationEnd.StationUser.Typemonth
1621326 2017-06-21 08:36:34 2017-06-21 08:44:43 489.066 14th & Belmont St NW 15th & K St NW Subscriber 06
482740 2017-03-11 10:40:00 2017-03-11 10:46:00 402.549 Yuma St & Tenley Circle NW Connecticut Ave & Yuma St NW Subscriber 03
1330037 2017-05-30 01:02:59 2017-05-30 01:13:37 637.251 17th St & Massachusetts Ave NW 5th & K St NW Subscriber 05
665458 2017-04-02 07:48:35 2017-04-02 08:19:03 1827.341 Constitution Ave & 2nd St NW/DOL M St & Pennsylvania Ave NW Customer 04
1481135 2017-06-10 08:36:28 2017-06-10 09:02:17 1549.427 Henry Bacon Dr & Lincoln Memorial Circle NW Maine Ave & 7th St SW Subscriber 06
1148202 2017-05-14 07:18:18 2017-05-14 07:24:56 398.000 1st & K St SE Eastern Market Metro / Pennsylvania Ave & 7th St SE Subscriber 05
In [248]:
qplot(x = month, data = wash, color = I('black'), fill = I('#0B6623'), xlab = 'Month', ylab = 'Number of Rides') #plotting wash dataframe for month of start.time
In [249]:
qplot(x = month, data = subset(wash, !is.na(month)), color = I('black'), fill = I('#0B6623'), xlab = 'Month', ylab = 'Number of Rides') ##Removing NA column from above
In [250]:
table(wash$month)
   01    02    03    04    05    06 
 8946 11563 12612 18522 17072 20335 
Chicago
In [251]:
chi$month <- format(as.Date(chi$Start.Time, format="%Y-%m-%d"),"%m") #extracting month for chi dataframe
In [252]:
head(chi) #verifying new column & month extracted for chi dataframe
XStart.TimeEnd.TimeTrip.DurationStart.StationEnd.StationUser.TypeGenderBirth.Yearmonth
1423854 2017-06-23 15:09:32 2017-06-23 15:14:53 321 Wood St & Hubbard St Damen Ave & Chicago Ave Subscriber Male 1992 06
955915 2017-05-25 18:19:03 2017-05-25 18:45:53 1610 Theater on the Lake Sheffield Ave & Waveland Ave Subscriber Female 1992 05
9031 2017-01-04 08:27:49 2017-01-04 08:34:45 416 May St & Taylor St Wood St & Taylor St Subscriber Male 1981 01
304487 2017-03-06 13:49:38 2017-03-06 13:55:28 350 Christiana Ave & Lawrence Ave St. Louis Ave & Balmoral Ave Subscriber Male 1986 03
45207 2017-01-17 14:53:07 2017-01-17 15:02:01 534 Clark St & Randolph St Desplaines St & Jackson Blvd Subscriber Male 1975 01
1473887 2017-06-26 09:01:20 2017-06-26 09:11:06 586 Clinton St & Washington Blvd Canal St & Taylor St Subscriber Male 1990 06
In [253]:
qplot(x = month, data = chi, color = I('black'), fill = I('#D30000'), xlab = 'Month', ylab = 'Number of Rides') #plotting month of Start.Time for chi dataframe
In [254]:
table(chi$month)
  01   02   03   04   05   06 
 650  930  803 1526 1905 2816 

Summary of your question 1 results goes here.

In our first question, we wanted to see which Month was the most popular month for bikeshare rides across the cities of New York, Washington, and Chicago.

In New York, the most popular month was the 6th month of the year, which is June. June had a total of 14,000 rides which was 1,820 more rides than the next closest month which was May with 12,180 rides.

In Washington, the most popular month was also June. Washington had a total of 20,335 rides in the month of June which was 1,813 more rides than the next closest month. The 2nd highest month for rides in Washington was April with 18,522 rides.

In Chicago, the most popular months for rides was also the month of June. The month of June in Chicago had a total of 2816 rides. The next closest month was May with 1905 rides. One thing to note is Chicago's dataset does show a much lower volume of bikeshare rides overall.

Overall, the datasets for all 3 cities(New York, Washington, & Chicago) indicate that the most popular month for bikeshare rides is June. This would suggest that the most rides would occur during the summer month of June possibly due to more favorable weather coniditions in these 3 particular cities.

Question 2

User Info - What is the most common User Type across the 3 cities?

New York
In [255]:
table(ny$User.Type) ##Getting some info on the User.Type column, looks like we have some blanks
             Customer Subscriber 
       119       5558      49093 
In [256]:
ny["User.Type"][ny["User.Type"] == ''] <- NA ##converting the blanks in the User.Type column to NA
In [257]:
table(ny$User.Type) #checking to see if conversion worked
             Customer Subscriber 
         0       5558      49093 
In [258]:
ggplot(data=subset(ny, !is.na(User.Type)), aes(x=User.Type)) + 
geom_bar(color = 'black', fill = '#F79420')
Washington
In [259]:
table(wash$User.Type) ##Getting some info on the User.Type column
             Customer Subscriber 
         1      23450      65600 
In [260]:
wash["User.Type"][wash["User.Type"] == ''] <- NA ##converting any blanks in the User.Type column to NA
In [261]:
table(wash$User.Type) #checking to see if conversion worked
             Customer Subscriber 
         0      23450      65600 
In [262]:
ggplot(data=subset(wash, !is.na(User.Type)), aes(x=User.Type)) + 
geom_bar(color = 'black', fill = '#0B6623')
Chicago
In [263]:
table(chi$User.Type) ##Getting some info on the User.Type column
             Customer Subscriber 
         1       1746       6883 
In [264]:
chi["User.Type"][chi["User.Type"] == ''] <- NA ##converting the blanks in the User.Type column to NA
In [265]:
table(chi$User.Type) #checking to see if conversion worked
             Customer Subscriber 
         0       1746       6883 
In [266]:
ggplot(data=subset(chi, !is.na(User.Type)), aes(x=User.Type)) + 
geom_bar(color = 'black', fill = '#D30000')

Summary of your question 2 results goes here.

In our second question, we wanted to see which User Type was the most common for bikeshare rides across the cities of New York, Washington, and Chicago.

In New York, the most common user type was the Subscriber. The subscriber user type had a total of 49,093 while the customer user type had a total of 5,558. This would equate to a difference of 43,535 between user types in New York. About 90% of bikeshare riders in New York are subscribers.

In Washington, the most common user type was also the Subscriber. Here, the subscriber user type had a total of 65,600 while the customer user type had a total of 23,450. This would equate to a difference of 42,150 between user types in Washington. While Washington does have a higher overall customer user type the gap difference remains similar to that of New York. About 74% of bikeshare riders in Washington are subscribers.

In Chicago, the most common user type was also the Subscriber. Here, the subscriber user type had a total of 6,883 while the customer user type had a total of 1,746. This would equate to a difference of 5,137 between user types in Chicago. Again, we note Chicago's dataset does show a much lower volume of bikeshare rides overall in the city. Nonetheless, about 80% of Chicago's bikeshare rides are from subscribers.

Overall, the datasets for all 3 cities(New York, Washington, & Chicago) indicate that an overwhelming majority of bikeshare rides are utilized by the 'Subscriber' user type. This is reinforced by the data which shows each city having at least 74% of their total bikeshare rides come from subscribers over any other user type.

Question 3

User Info - What gender type is the most common for the bikeshare data across the cities of New York & Chicago?

  • Note: Gender data is not available for the city of Washington
In [267]:
names(ny)
  1. 'X'
  2. 'Start.Time'
  3. 'End.Time'
  4. 'Trip.Duration'
  5. 'Start.Station'
  6. 'End.Station'
  7. 'User.Type'
  8. 'Gender'
  9. 'Birth.Year'
  10. 'month'
In [268]:
names(wash) ##showing NO 'Gender' column in wash dataset
  1. 'X'
  2. 'Start.Time'
  3. 'End.Time'
  4. 'Trip.Duration'
  5. 'Start.Station'
  6. 'End.Station'
  7. 'User.Type'
  8. 'month'
In [269]:
names(chi)
  1. 'X'
  2. 'Start.Time'
  3. 'End.Time'
  4. 'Trip.Duration'
  5. 'Start.Station'
  6. 'End.Station'
  7. 'User.Type'
  8. 'Gender'
  9. 'Birth.Year'
  10. 'month'
New York
In [270]:
by(ny$User.Type, ny$Gender, summary)
ny$Gender: 
             Customer Subscriber       NA's 
         0       4743        664          3 
------------------------------------------------------------ 
ny$Gender: Female
             Customer Subscriber       NA's 
         0        324      11804         31 
------------------------------------------------------------ 
ny$Gender: Male
             Customer Subscriber       NA's 
         0        491      36625         85 
In [271]:
ny["Gender"][ny["Gender"] == ''] <- NA ##converting the blanks in the Gender column to NA
In [272]:
by(ny$User.Type, ny$Gender, summary)
ny$Gender: 
NULL
------------------------------------------------------------ 
ny$Gender: Female
             Customer Subscriber       NA's 
         0        324      11804         31 
------------------------------------------------------------ 
ny$Gender: Male
             Customer Subscriber       NA's 
         0        491      36625         85 
In [273]:
ny <- na.omit(ny) ##Omit the NAs
In [274]:
by(ny$User.Type, ny$Gender, summary)
ny$Gender: 
NULL
------------------------------------------------------------ 
ny$Gender: Female
             Customer Subscriber 
         0        324      11803 
------------------------------------------------------------ 
ny$Gender: Male
             Customer Subscriber 
         0        491      36625 
In [283]:
qplot(x = User.Type, data = ny, color = I('black'), fill = I('#F79420')) +
    facet_grid(Gender~.)
Chicago
In [284]:
by(chi$User.Type, chi$Gender, summary)
chi$Gender: 
             Customer Subscriber       NA's 
         0       1746          1          1 
------------------------------------------------------------ 
chi$Gender: Female
             Customer Subscriber 
         0          0       1723 
------------------------------------------------------------ 
chi$Gender: Male
             Customer Subscriber 
         0          0       5159 
In [285]:
chi["Gender"][chi["Gender"] == ''] <- NA ##converting the blanks in the Gender column to NA
In [286]:
by(chi$User.Type, chi$Gender, summary)
chi$Gender: 
NULL
------------------------------------------------------------ 
chi$Gender: Female
             Customer Subscriber 
         0          0       1723 
------------------------------------------------------------ 
chi$Gender: Male
             Customer Subscriber 
         0          0       5159 
In [287]:
chi <- na.omit(chi) ##Omit the NAs
In [288]:
by(chi$User.Type, chi$Gender, summary)
chi$Gender: 
NULL
------------------------------------------------------------ 
chi$Gender: Female
             Customer Subscriber 
         0          0       1723 
------------------------------------------------------------ 
chi$Gender: Male
             Customer Subscriber 
         0          0       5159 
In [289]:
qplot(x = User.Type, data = chi, color = I('black'), fill = I('#D30000')) +
    facet_grid(Gender~.)

Summary of your question 3 results goes here.

In our third and final question, we wanted to see what the makeup was of Gender for each of the User Types for bikeshare rides across the cities of New York and Chicago. The city of Washington did not contain a gender column for us to analyze.

In New York, the Subscriber user type was made up of mostly Males. The subscribers that were males outnumbered the females by about 3 to 1. On the other hand, the customer user type was more evenly balanced with females and males being around the same number.

In Chicago, the Subscriber user type was also made up of mostly Males. The subscribers that were males outnumbered the females by about 5 to 1 in this dataset. Unfortunately, we were unable to analyze what gender type for the user type customer as it seems all of the customer user types in the Chicago dataset had a NA/NULL/Blank for the gender column. The assumption here is that they don't collect gender information on customers but do on subscribers.

Overall, the datasets for New York and Chicago indicate that an overwhelming majority of bikeshare riders that are 'Subscribers' are of the gender 'Male'. Based of the data we have the customer user type seemed more balanced when it came to gender, but again the sample size was much smaller for that user type.

Finishing Up

Congratulations! You have reached the end of the Explore Bikeshare Data Project.

In [290]:
system('python -m nbconvert Explore_bikeshare_data.ipynb')