Introduction

This case analysis study was my choice of project towards my final requirement for the Google Data Analytics Professional Certificate. The case study involves a bike share company’s data of its customer’s trip details over a 12 month period (July 2020 - June 2021).

Scenario and Objective

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, the aim of this analysis is to understand how casual riders and annual members use Cyclistic bikes differently. My goal is to discover some data driven insights that the marketing department can use to help convert casual riders into annual members.

Data source

The case study involves a bikeshare company’s data of its customer’s trip details over a 12 month period (July 2020 - June 2021). The data has been made available by Motivate International Inc. under this license.

Program of choice and installed packages

I chose to use R programming language wit RStudio since the data sets were too large to work with Google Sheets. We installed the following packages for RStudio tidyverse, dplyr & janitor

Data Preparation

The data for each month’s trip details was given in a .csv format. To import into R we used the following command read.csv for each month’s data. In order to combine all data points into one data set we used the rbind function along with head to see the top 6 rows of our newly imported data, str to see the structure of our new database and glympse to get a quick general glimpse on the data.

df1 <- read.csv("TripData/202008.csv")
df2 <- read.csv("TripData/202009.csv")
df3 <- read.csv("TripData/202010.csv")
df4 <- read.csv("TripData/202011.csv")
df5 <- read.csv("TripData/202012.csv")
df6 <- read.csv("TripData/202101.csv")
df7 <- read.csv("TripData/202102.csv")
df8 <- read.csv("TripData/202103.csv")
df9 <- read.csv("TripData/202104.csv")
df10 <- read.csv("TripData/202105.csv")
df11 <- read.csv("TripData/202106.csv")
df12<- read.csv("TripData/202107.csv")
cyclistic_df <- rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12)
head(cyclistic_df)
##            ride_id rideable_type          started_at            ended_at
## 1 322BD23D287743ED   docked_bike 2020-08-20 18:08:14 2020-08-20 18:17:51
## 2 2A3AEF1AB9054D8B electric_bike 2020-08-27 18:46:04 2020-08-27 19:54:51
## 3 67DC1D133E8B5816 electric_bike 2020-08-26 19:44:14 2020-08-26 21:53:07
## 4 C79FBBD412E578A7 electric_bike 2020-08-27 12:05:41 2020-08-27 12:53:45
## 5 13814D3D661ECADB electric_bike 2020-08-27 16:49:02 2020-08-27 16:59:49
## 6 56349A5A42F0AE51 electric_bike 2020-08-27 17:26:23 2020-08-27 18:07:50
##              start_station_name start_station_id         end_station_name
## 1 Lake Shore Dr & Diversey Pkwy              329   Clark St & Lincoln Ave
## 2        Michigan Ave & 14th St              168   Michigan Ave & 14th St
## 3     Columbus Dr & Randolph St              195   State St & Randolph St
## 4            Daley Center Plaza               81     State St & Kinzie St
## 5      Leavitt St & Division St              658 Leavitt St & Division St
## 6      Leavitt St & Division St              658 Leavitt St & Division St
##   end_station_id start_lat start_lng  end_lat   end_lng member_casual
## 1            141  41.93259 -87.63643 41.91569 -87.63460        member
## 2            168  41.86438 -87.62368 41.86422 -87.62344        casual
## 3             44  41.88464 -87.61955 41.88497 -87.62757        casual
## 4             47  41.88409 -87.62964 41.88958 -87.62754        casual
## 5            658  41.90299 -87.68377 41.90300 -87.68384        casual
## 6            658  41.90302 -87.68373 41.90309 -87.68363        casual
str(cyclistic_df)
## 'data.frame':    4731081 obs. of  13 variables:
##  $ ride_id           : chr  "322BD23D287743ED" "2A3AEF1AB9054D8B" "67DC1D133E8B5816" "C79FBBD412E578A7" ...
##  $ rideable_type     : chr  "docked_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2020-08-20 18:08:14" "2020-08-27 18:46:04" "2020-08-26 19:44:14" "2020-08-27 12:05:41" ...
##  $ ended_at          : chr  "2020-08-20 18:17:51" "2020-08-27 19:54:51" "2020-08-26 21:53:07" "2020-08-27 12:53:45" ...
##  $ start_station_name: chr  "Lake Shore Dr & Diversey Pkwy" "Michigan Ave & 14th St" "Columbus Dr & Randolph St" "Daley Center Plaza" ...
##  $ start_station_id  : chr  "329" "168" "195" "81" ...
##  $ end_station_name  : chr  "Clark St & Lincoln Ave" "Michigan Ave & 14th St" "State St & Randolph St" "State St & Kinzie St" ...
##  $ end_station_id    : chr  "141" "168" "44" "47" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr  "member" "casual" "casual" "casual" ...
glimpse(cyclistic_df)
## Rows: 4,731,081
## Columns: 13
## $ ride_id            <chr> "322BD23D287743ED", "2A3AEF1AB9054D8B", "67DC1D133E…
## $ rideable_type      <chr> "docked_bike", "electric_bike", "electric_bike", "e…
## $ started_at         <chr> "2020-08-20 18:08:14", "2020-08-27 18:46:04", "2020…
## $ ended_at           <chr> "2020-08-20 18:17:51", "2020-08-27 19:54:51", "2020…
## $ start_station_name <chr> "Lake Shore Dr & Diversey Pkwy", "Michigan Ave & 14…
## $ start_station_id   <chr> "329", "168", "195", "81", "658", "658", "196", "67…
## $ end_station_name   <chr> "Clark St & Lincoln Ave", "Michigan Ave & 14th St",…
## $ end_station_id     <chr> "141", "168", "44", "47", "658", "658", "49", "229"…
## $ start_lat          <dbl> 41.93259, 41.86438, 41.88464, 41.88409, 41.90299, 4…
## $ start_lng          <dbl> -87.63643, -87.62368, -87.61955, -87.62964, -87.683…
## $ end_lat            <dbl> 41.91569, 41.86422, 41.88497, 41.88958, 41.90300, 4…
## $ end_lng            <dbl> -87.63460, -87.62344, -87.62757, -87.62754, -87.683…
## $ member_casual      <chr> "member", "casual", "casual", "casual", "casual", "…

Check for mising data?

It is always a good practice to see if any columns or row are missing data since that can altered our analysis. To determine whether any data was missing from the data frame we used any(in.na)

any(is.na(cyclistic_df)) # detect anywhere in df
## [1] TRUE

Cleaning Data

We know there is data missing based on the answer we got from the command above and so we use the following functions from janitor library

cyclistic_df <- janitor::remove_empty(cyclistic_df,which = c("rows"))
cyclistic_df <- janitor::remove_empty(cyclistic_df,which = c("cols"))

We also know, from looking at the structure of our data frame, we won’t be needing latitude and longitude columns for our analysis and so we use the following code to drop/delete those unnecessary columns

cyclistic_df <- select(cyclistic_df,-c(start_lat,start_lng,end_lat,end_lng))

Convert Date character data into Date/Time data

After looking at the structure of our data frame using the function str we realized the datetime data on columns started_at and ended_at was of character class type and thus we used the following ymd_hms function from lubridate to convert it to datetime class type

cyclistic_df$started_at <- lubridate::ymd_hms(cyclistic_df$started_at)
cyclistic_df$ended_at <- lubridate::ymd_hms(cyclistic_df$ended_at)

Data Analysis

Our data is now clean and ready for analysis! Since we are interested in understanding how differently the “Casual” and “Members” group use the platform, the first thing we like to see is whether there is a difference between the length of rides from each member type. We added a new column ride_length by subtracting ended_at from started_at using the function difftime.

cyclistic_df$ride_length <- difftime(cyclistic_df$ended_at,cyclistic_df$started_at, units = "mins")

We check for rows with ride_lengths equal to or below 0 to see if they exist and if so to erase

nrow(subset(cyclistic_df, ride_length <= 0))
## [1] 8619

More Cleaning

We know we wont be interested in data where the ride_length is equal to 0 due to tests or errors and so we use the following code to erase those rows

cyclistic_df <- cyclistic_df[cyclistic_df$ride_length > 0,]

We check again to see if rows have been deleted

nrow(subset(cyclistic_df, ride_length <= 0))
## [1] 0

Analysis continues

We are interested in seeing whether there is a difference in days of the week and months of the year the platform is used by different groups. To do so, we added a new column day_of__week using the function weekday and month_year using the function format

cyclistic_df$day_of_week <- weekdays(cyclistic_df$started_at)
cyclistic_df$month_year <- format(as.Date(cyclistic_df$started_at), "%b_%y")

Lastly we changed one of the column names

The column name member_casual can be a bit confusing so let’s call it group_type since it basically categorize the users of the platform as either Members or Casual(non-members)

cyclistic_df <- rename(cyclistic_df,group_type = member_casual)

Time to ask our data some questions!

Before we ask some questions lets take a final look at our data to make sure everything is as it should be. Lets look at the top 6 rows:

head(cyclistic_df)
##            ride_id rideable_type          started_at            ended_at
## 1 322BD23D287743ED   docked_bike 2020-08-20 18:08:14 2020-08-20 18:17:51
## 2 2A3AEF1AB9054D8B electric_bike 2020-08-27 18:46:04 2020-08-27 19:54:51
## 3 67DC1D133E8B5816 electric_bike 2020-08-26 19:44:14 2020-08-26 21:53:07
## 4 C79FBBD412E578A7 electric_bike 2020-08-27 12:05:41 2020-08-27 12:53:45
## 5 13814D3D661ECADB electric_bike 2020-08-27 16:49:02 2020-08-27 16:59:49
## 6 56349A5A42F0AE51 electric_bike 2020-08-27 17:26:23 2020-08-27 18:07:50
##              start_station_name start_station_id         end_station_name
## 1 Lake Shore Dr & Diversey Pkwy              329   Clark St & Lincoln Ave
## 2        Michigan Ave & 14th St              168   Michigan Ave & 14th St
## 3     Columbus Dr & Randolph St              195   State St & Randolph St
## 4            Daley Center Plaza               81     State St & Kinzie St
## 5      Leavitt St & Division St              658 Leavitt St & Division St
## 6      Leavitt St & Division St              658 Leavitt St & Division St
##   end_station_id group_type     ride_length day_of_week month_year
## 1            141     member   9.616667 mins    Thursday     Aug_20
## 2            168     casual  68.783333 mins    Thursday     Aug_20
## 3             44     casual 128.883333 mins   Wednesday     Aug_20
## 4             47     casual  48.066667 mins    Thursday     Aug_20
## 5            658     casual  10.783333 mins    Thursday     Aug_20
## 6            658     casual  41.450000 mins    Thursday     Aug_20

Everything looks good!

What is the ride count by group type?

ride_length_by_group <- cyclistic_df %>% 
                          group_by(group_type) %>% 
                          count(group_type)
ride_length_by_group
## # A tibble: 2 × 2
## # Groups:   group_type [2]
##   group_type       n
##   <chr>        <int>
## 1 casual     2099442
## 2 member     2623020

Let’s have a visual of that!

ggplot(ride_length_by_group, aes(x=group_type, y = n, fill = group_type)) + geom_col() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) + labs(title="Ride Count Per Group",x = "Group Type", y="Count")

1st Conclusion

We can clearly see that the members group use the platform more times than the casual group

What is the total trip duration by group members?

  total_trip_duration <- cyclistic_df %>%
  group_by(group_type) %>%
  summarise(total_trip_duration = sum(ride_length))

Let’s have a visual of that!

 ggplot(total_trip_duration, aes(x=group_type, y = total_trip_duration, fill = group_type)) + geom_col() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) + labs(title="Total Trip Duration Per Group",x = "Group Type", y="Count")

Insight #1

Interesting enough, although the member group use the platform more, the casual group’s trip duration is clearly longer. Perhaps the casual group uses the platform for more leisure type rides

What is the average trip duration by group members?

  average_trip_duration <- cyclistic_df %>%
  group_by(group_type) %>%
  summarise(average_trip_duration = mean(ride_length))
 ggplot(average_trip_duration, aes(x=group_type, y = average_trip_duration, fill = group_type)) + geom_col() + scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) + labs(title="Average Trip Duration Per Group",x = "Group Type", y="Count")

Conclusion #2

Casual group’s trip duration a little over 40 min while the average ride duration for member’s group is about 15 min. Big difference!

What is the start day of the week rental by group members?

  start_day_by_group <- cyclistic_df %>%
  group_by(group_type, day_of_week) %>%
  summarise(number_of_rides = n(), average_trip_duration = mean(ride_length)) %>%
  arrange(group_type, desc(number_of_rides))
## `summarise()` has grouped output by 'group_type'. You can override using the `.groups` argument.
start_day_by_group
## # A tibble: 14 × 4
## # Groups:   group_type [2]
##    group_type day_of_week number_of_rides average_trip_duration
##    <chr>      <chr>                 <int> <drtn>               
##  1 casual     Saturday             488365 39.62124 mins        
##  2 casual     Sunday               392472 42.56039 mins        
##  3 casual     Friday               302783 34.85937 mins        
##  4 casual     Thursday             231877 32.12703 mins        
##  5 casual     Wednesday            231184 32.54648 mins        
##  6 casual     Monday               229626 36.19306 mins        
##  7 casual     Tuesday              223135 32.66356 mins        
##  8 member     Wednesday            398766 14.01149 mins        
##  9 member     Saturday             396899 16.22152 mins        
## 10 member     Friday               388560 14.47307 mins        
## 11 member     Tuesday              380080 13.88980 mins        
## 12 member     Thursday             379881 13.72749 mins        
## 13 member     Monday               348733 14.27652 mins        
## 14 member     Sunday               330101 16.76377 mins
ggplot(start_day_by_group, aes(x=day_of_week, y = number_of_rides, fill = group_type)) + geom_col(width=0.5, position = position_dodge(width=0.5)) + scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) + labs(title="Number of rides per day of the week, per group type",x = "Week Days", y="Count")