For this project, we are using R Studio and our goal was to ask and answer three questions about the available bikeshare data from Washington, Chicago, and New York.
ny = read.csv('new_york_city.csv')
wash = read.csv('washington.csv')
chi = read.csv('chicago.csv')
head(ny)
head(wash)
head(chi)
Popular times of travel - What is the most common month for traveling users
library(ggplot2)
First, we will need to extract the month from the Start.Time field in each of our 3 datasets.
ny$month <- format(as.Date(ny$Start.Time, format="%Y-%m-%d"),"%m") ##creating new column in dataframe and extracting month from Start.Time
head(ny) ##confirming new column created and extracted month
qplot(x = month, data = ny, color = I('black'), fill = I('#F79420'), xlab = 'Month', ylab = 'Number of Rides') #plotting Month
table(ny$month) ##using table to get summary info
wash$month <- format(as.Date(wash$Start.Time, format="%Y-%m-%d"),"%m") #extracting month for wash dataframe
head(wash) #verifying extraction of month and new column
qplot(x = month, data = wash, color = I('black'), fill = I('#0B6623'), xlab = 'Month', ylab = 'Number of Rides') #plotting wash dataframe for month of start.time
qplot(x = month, data = subset(wash, !is.na(month)), color = I('black'), fill = I('#0B6623'), xlab = 'Month', ylab = 'Number of Rides') ##Removing NA column from above
table(wash$month)
chi$month <- format(as.Date(chi$Start.Time, format="%Y-%m-%d"),"%m") #extracting month for chi dataframe
head(chi) #verifying new column & month extracted for chi dataframe
qplot(x = month, data = chi, color = I('black'), fill = I('#D30000'), xlab = 'Month', ylab = 'Number of Rides') #plotting month of Start.Time for chi dataframe
table(chi$month)
Summary of your question 1 results goes here.
In our first question, we wanted to see which Month was the most popular month for bikeshare rides across the cities of New York, Washington, and Chicago.
In New York, the most popular month was the 6th month of the year, which is June. June had a total of 14,000 rides which was 1,820 more rides than the next closest month which was May with 12,180 rides.
In Washington, the most popular month was also June. Washington had a total of 20,335 rides in the month of June which was 1,813 more rides than the next closest month. The 2nd highest month for rides in Washington was April with 18,522 rides.
In Chicago, the most popular months for rides was also the month of June. The month of June in Chicago had a total of 2816 rides. The next closest month was May with 1905 rides. One thing to note is Chicago's dataset does show a much lower volume of bikeshare rides overall.
Overall, the datasets for all 3 cities(New York, Washington, & Chicago) indicate that the most popular month for bikeshare rides is June. This would suggest that the most rides would occur during the summer month of June possibly due to more favorable weather coniditions in these 3 particular cities.
User Info - What is the most common User Type across the 3 cities?
table(ny$User.Type) ##Getting some info on the User.Type column, looks like we have some blanks
ny["User.Type"][ny["User.Type"] == ''] <- NA ##converting the blanks in the User.Type column to NA
table(ny$User.Type) #checking to see if conversion worked
ggplot(data=subset(ny, !is.na(User.Type)), aes(x=User.Type)) +
geom_bar(color = 'black', fill = '#F79420')
table(wash$User.Type) ##Getting some info on the User.Type column
wash["User.Type"][wash["User.Type"] == ''] <- NA ##converting any blanks in the User.Type column to NA
table(wash$User.Type) #checking to see if conversion worked
ggplot(data=subset(wash, !is.na(User.Type)), aes(x=User.Type)) +
geom_bar(color = 'black', fill = '#0B6623')
table(chi$User.Type) ##Getting some info on the User.Type column
chi["User.Type"][chi["User.Type"] == ''] <- NA ##converting the blanks in the User.Type column to NA
table(chi$User.Type) #checking to see if conversion worked
ggplot(data=subset(chi, !is.na(User.Type)), aes(x=User.Type)) +
geom_bar(color = 'black', fill = '#D30000')
Summary of your question 2 results goes here.
In our second question, we wanted to see which User Type was the most common for bikeshare rides across the cities of New York, Washington, and Chicago.
In New York, the most common user type was the Subscriber. The subscriber user type had a total of 49,093 while the customer user type had a total of 5,558. This would equate to a difference of 43,535 between user types in New York. About 90% of bikeshare riders in New York are subscribers.
In Washington, the most common user type was also the Subscriber. Here, the subscriber user type had a total of 65,600 while the customer user type had a total of 23,450. This would equate to a difference of 42,150 between user types in Washington. While Washington does have a higher overall customer user type the gap difference remains similar to that of New York. About 74% of bikeshare riders in Washington are subscribers.
In Chicago, the most common user type was also the Subscriber. Here, the subscriber user type had a total of 6,883 while the customer user type had a total of 1,746. This would equate to a difference of 5,137 between user types in Chicago. Again, we note Chicago's dataset does show a much lower volume of bikeshare rides overall in the city. Nonetheless, about 80% of Chicago's bikeshare rides are from subscribers.
Overall, the datasets for all 3 cities(New York, Washington, & Chicago) indicate that an overwhelming majority of bikeshare rides are utilized by the 'Subscriber' user type. This is reinforced by the data which shows each city having at least 74% of their total bikeshare rides come from subscribers over any other user type.
User Info - What gender type is the most common for the bikeshare data across the cities of New York & Chicago?
names(ny)
names(wash) ##showing NO 'Gender' column in wash dataset
names(chi)
by(ny$User.Type, ny$Gender, summary)
ny["Gender"][ny["Gender"] == ''] <- NA ##converting the blanks in the Gender column to NA
by(ny$User.Type, ny$Gender, summary)
ny <- na.omit(ny) ##Omit the NAs
by(ny$User.Type, ny$Gender, summary)
qplot(x = User.Type, data = ny, color = I('black'), fill = I('#F79420')) +
facet_grid(Gender~.)
by(chi$User.Type, chi$Gender, summary)
chi["Gender"][chi["Gender"] == ''] <- NA ##converting the blanks in the Gender column to NA
by(chi$User.Type, chi$Gender, summary)
chi <- na.omit(chi) ##Omit the NAs
by(chi$User.Type, chi$Gender, summary)
qplot(x = User.Type, data = chi, color = I('black'), fill = I('#D30000')) +
facet_grid(Gender~.)
Summary of your question 3 results goes here.
In our third and final question, we wanted to see what the makeup was of Gender for each of the User Types for bikeshare rides across the cities of New York and Chicago. The city of Washington did not contain a gender column for us to analyze.
In New York, the Subscriber user type was made up of mostly Males. The subscribers that were males outnumbered the females by about 3 to 1. On the other hand, the customer user type was more evenly balanced with females and males being around the same number.
In Chicago, the Subscriber user type was also made up of mostly Males. The subscribers that were males outnumbered the females by about 5 to 1 in this dataset. Unfortunately, we were unable to analyze what gender type for the user type customer as it seems all of the customer user types in the Chicago dataset had a NA/NULL/Blank for the gender column. The assumption here is that they don't collect gender information on customers but do on subscribers.
Overall, the datasets for New York and Chicago indicate that an overwhelming majority of bikeshare riders that are 'Subscribers' are of the gender 'Male'. Based of the data we have the customer user type seemed more balanced when it came to gender, but again the sample size was much smaller for that user type.
Congratulations! You have reached the end of the Explore Bikeshare Data Project.
system('python -m nbconvert Explore_bikeshare_data.ipynb')