Google Data Analytics: Cyclistic

TL;DR?

You can see a sample business deck summarising the key findings and recommendations based on insights I generated from the dataset described in this page here.

Foreword

I recently completed my Google Data Analytics Professional certificate and at the end of the course was provided with this set of real world transportation data from a bike-sharing company which has been anonymysed as belonging to a fictional company called ‘Cyclistic’, and were given instructions to run analytics on it. The followings are the fictional plots that have been provided alongside the public data:

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Characters and Teams

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

My analytics workflow

From the background given, the finance analyst of Cyclistic has reported that annual member generates more profit for the company than casual riders, and the director has instructed for a new marketing campaign aiming to convert existing casual riders into annual members. To provide a good recommendation, we will dive into the available data to understand the preferences and profiles of riders based on their membership status to come up with the most effective market strategy to convert casual into annual members.

Problem statement: How to convert Cyclistic’s casual riders into annual members?

Downloading the dataset

The instruction brought us to a AWS S3 bucket index page here which provides the links for all the data to be downloaded. However, I wasn’t able to bulk download the data directly through AWS command line interface; for some reason my access was always restricted.

Instead of clicking each link and downloading them manually, I utilised the Ubuntu CLI to scrape the URL links from the index page, then loop a simple wget command to download all files in one command.

First off, save the index page https://divvy-tripdata.s3.amazonaws.com/index.html as a text file. I then open the text file in a Ubuntu terminal and used grep, sed and awk to filter only the working URLs to each of the sample data.

cat divvy_text.txt | \ # read file
grep https | \ # filter only line with https pattern 
awk -F "\t" '{print $1}' | \ # print only the first variable in line, separated by a tab
sed 's/<//' | \ # remove the character < before each https link
sed 's/>//' | \ # remove the character > after each https link
sed '$ d' \ # remove last line 
> divvy_url.txt # save as new file

cat divvy_text.txt | grep https | awk -F "\t" '{print $1}' | sed 's/<//' | sed 's/>//' | sed '$ d' > divvy_url.txt

Afterwards, we can simply bulk download all the sample files using the following wget command

wget -i divvy_url.txt

There are a lot of zipped files here! I’m not sure if I am supposed to only use one of them or all of them at this stage, but to be safe, I unzip all of them in the terminals using the following for-loop:

for line in `ls | sed 's/.zip//'`; do unzip ${line}.zip -d ${line}; done # for loop to unzip all files into their individual folders
rm *zip       # delete zip files
mv */*csv .   # move only csv file into main directory
rm -rf `ls -d */` # remove folders after copying the csv files

Now all the raw data is in a single directory. Let’s view them:

##  [1] "202004-divvy-tripdata.csv"       "202005-divvy-tripdata.csv"      
##  [3] "202006-divvy-tripdata.csv"       "202007-divvy-tripdata.csv"      
##  [5] "202008-divvy-tripdata.csv"       "202009-divvy-tripdata.csv"      
##  [7] "202010-divvy-tripdata.csv"       "202011-divvy-tripdata.csv"      
##  [9] "202012-divvy-tripdata.csv"       "202101-divvy-tripdata.csv"      
## [11] "202102-divvy-tripdata.csv"       "202103-divvy-tripdata.csv"      
## [13] "202104-divvy-tripdata.csv"       "202105-divvy-tripdata.csv"      
## [15] "202106-divvy-tripdata.csv"       "202107-divvy-tripdata.csv"      
## [17] "202108-divvy-tripdata.csv"       "202109-divvy-tripdata.csv"      
## [19] "202110-divvy-tripdata.csv"       "202111-divvy-tripdata.csv"      
## [21] "202112-divvy-tripdata.csv"       "202201-divvy-tripdata.csv"      
## [23] "202202-divvy-tripdata.csv"       "202203-divvy-tripdata.csv"      
## [25] "202204-divvy-tripdata.csv"       "202205-divvy-tripdata.csv"      
## [27] "202206-divvy-tripdata.csv"       "202207-divvy-tripdata.csv"      
## [29] "202208-divvy-tripdata.csv"       "202209-divvy-publictripdata.csv"
## [31] "202210-divvy-tripdata.csv"       "202211-divvy-tripdata.csv"      
## [33] "Divvy_Stations_2015.csv"         "Divvy_Stations_2016_Q3.csv"     
## [35] "Divvy_Stations_2016_Q4.csv"      "Divvy_Stations_2017_Q1Q2.csv"   
## [37] "Divvy_Stations_2017_Q3Q4.csv"    "Divvy_Trips_2014_Q1Q2.csv"      
## [39] "Divvy_Trips_2015-Q1.csv"         "Divvy_Trips_2015-Q2.csv"        
## [41] "Divvy_Trips_2015_07.csv"         "Divvy_Trips_2015_08.csv"        
## [43] "Divvy_Trips_2015_09.csv"         "Divvy_Trips_2015_Q4.csv"        
## [45] "Divvy_Trips_2016_Q3.csv"         "Divvy_Trips_2016_Q4.csv"        
## [47] "Divvy_Trips_2017_Q1.csv"         "Divvy_Trips_2017_Q2.csv"        
## [49] "Divvy_Trips_2017_Q3.csv"         "Divvy_Trips_2017_Q4.csv"        
## [51] "Divvy_Trips_2018_Q1.csv"         "Divvy_Trips_2018_Q2.csv"        
## [53] "Divvy_Trips_2018_Q3.csv"         "Divvy_Trips_2018_Q4.csv"        
## [55] "Divvy_Trips_2019_Q1.csv"         "Divvy_Trips_2019_Q2.csv"        
## [57] "Divvy_Trips_2019_Q3.csv"         "Divvy_Trips_2019_Q4.csv"        
## [59] "Divvy_Trips_2020_Q1.csv"

We will utilise data from 2020-2022 to get the trend of the last 3 years. Since each month’s trip data is saved to a different file, we will merge them together to facilitate a thorough analysis.

library(tidyverse)

tripdatalist = list.files(path = 'input/raw', pattern = 'tripdata.csv', full.names = T)

tripdataall = 
  lapply(tripdatalist, function(x) {
  out = 
    read_csv(x) %>% 
    mutate(start_station_id = as.character(start_station_id),
           end_station_id = as.character(end_station_id))      # to prevent error when merging due to incompatible variable type
}) 

tripdata = dplyr::bind_rows(tripdataall)

rm(tripdataall) # remove the list to save memory

## # A tibble: 6 × 13
##   ride_id rideable_type started_at          ended_at            start_station_n…
##   <chr>   <chr>         <dttm>              <dttm>              <chr>           
## 1 A847FA… docked_bike   2020-04-26 17:45:14 2020-04-26 18:12:03 Eckhart Park    
## 2 5405B8… docked_bike   2020-04-17 17:08:54 2020-04-17 17:17:03 Drake Ave & Ful…
## 3 5DD24A… docked_bike   2020-04-01 17:54:13 2020-04-01 18:08:36 McClurg Ct & Er…
## 4 2A59BB… docked_bike   2020-04-07 12:50:19 2020-04-07 13:02:31 California Ave …
## 5 27AD30… docked_bike   2020-04-18 10:22:59 2020-04-18 11:15:54 Rush St & Hubba…
## 6 356216… docked_bike   2020-04-30 17:55:47 2020-04-30 18:01:11 Mies van der Ro…
## # … with 8 more variables: start_station_id <chr>, end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>

I have successfully merged the trip data files together, with a total of 14195770 rows and 13 columns. There are several variables that we can look at, including trip day and time, ride duration, routes and geolocations. Ride ID was also provided but was not referring to unique membership profile, hence we will not use it in our analysis.

Members ride shorter trip than casual riders

We can calculate the trip duration since we have the start and end time for each trip, as well as the distance covered since we have the station geolatitude data. Are there any difference in ride duration between members and casual riders? As the duration data was extremely left skewed, I log-transformed the data to achieve a better distribution before plotting them.

Members had shorter ride duration than non-members (Wilcoxon test, p<0.05). To see whether this is true all the time, I subset the analysis to each individual months to see whether the trend persisted.

The trend seems to be observed throughout the months and years.

Members and casual riders dominated different routes

Next up, we will evaluate whether casual riders or members are more likely to belong to be concentrated to certain are or whether they are uniformly distributed across Chicago. We can infer this by observing the stations most commonly used by casual riders or members. Since there are a lot of different stations to comfortably visualise, let’s statistically determine which station had significantly different usage between casual riders and members, and visualise those with significantly different membership proportion.

We observed 13 starting stations where the rate of casual riders were significantly higher than members. This is especially true for Streeter Driver & Grand Avenue, as well as Lake Shore Drive & Monroe Street and Millenium Park, with casual riders outnumbering members several folds.

Now let’s run the same analysis for destination station.

Interestingly, 6 stations were dominated by casual riders both as the starting and ending station. These areas might be within leisure zones, resulting in people to be more likely to come sporadically instead of being a member, although this likely needs to be discussed with a subject matter expert.

What if we considered the overall route taken by casual vs members? Let’s find out.

Investigating the route confirmed our hypothesis, where stations dominated by casual riders mainly started and ended at the same station. Although we need more data to support this claim, but the much higher proportion of casual riders in these routes might be indicative of their propensity to be used for recreational activity (hence few regular commuters), although this will need to be confirmed by the subject matter expert. Unfortunately, we also do not have information on the number of unique riders per station, which could support or deny this hypothesis.

Nevertheless, knowing that these stations are dominated by casual riders, let’s further investigate whether this trend is true throughout the year.

Among stations with >50 rides recorded, there doesn’t seem to be a time effect on casual/member riders on the popular routes, with each routes dominated by either member of casual riders throughout the year.

What if we ran the analysis on origin and destination station?

Similar to routes, the distribution of casual and member rides across stations seem to be similar throughout the years.

Significant but weak correlation between membership and geolatitude

We managed to identify a few distinction between casual and member riders so far, but let’s analyse this statistically to provide some numerical backings to our findings. We will do it by running a correlation analysis between membership and the numerical variables from our dataset (duration, latitude, longitude).

The correlation analyses returned some significant features associated with membership status. We have already discussed the correlation between membership and duration previously, so we’ll skip this. The correlation between starting and ending latitudes and longitudes likely relates to the common routes being taken.

One interesting this is the weak negative correlation observed between membership status and start/ending longitude. I will investigate this lead further to see if we can get something.

Although the statistics observed significant correlation between membership status and longitudinal coordinate, this is not immediately visible from the observed graph. Remember from our previous observation where we observed a preference for member/non-members towards certain stations/routes? This correlations could be a reflection of the membership preference, although basing it on latitude/longitude alone would be significantly confounded with noise since some member/non-member dominant stations might belong to a similar latitude but different longitude and vice cersa.

Members have a higher usage of classic bikes, with casual riders frequently utilising docked and electric bikes

There seems to be different proportion of memberships utilising each of three bike types. Members seem to utilise classic bikes more often than casual riders (54.17% of members ride classics vs 42.14% of casuals), while casuals exhibited almost twice the proportion of docked bike rides (31.94%) against members (20.16%), while electric bike usage was similar across membership status (member vs casual, 25.67% vs 25.91%).

Let’s see if this is true throughout the week:

Interestingly, casual riders outnumbered members in all three ride types on the weekend, with member rides being more frequent than casual riders throughout the weekday. Nevertheless, the difference between members and casual riders were much higher in classic bikes compared to the other two types.

Is there a specific stations which are more popular on the weekends or weekdays?

A lot of stations indeed have a higher number of total rides in the weekend compared to the weekday due to the higher count of casual riders. Frequency of member riders were more station-dependent, with some stations seeing a reduction in members riding on the weekends while others see a relatively constant number.

On a side note, this observation also supports our earlier findings where certain routes are dominated heavily by either members or non-members.

Let’s check the proportion of bike types across members among the popular stations.

We observe that bike preference was very route-dependent. Generally, casual riders were more likely to ride docked and electric bikes compared to members, who preferred to stick to classic bike. This likely requires more data, for example classic bike likely costs cheaper to ride than docked or electric bikes, hence regular users might prefer the cheaper option compared to the presumed sporadic ride habits of casual members. More data on unique rider ID and cost of rides is needed to ascertain this hypothesis.

Members were more likely to utilise bikes throughout the year then non-members, whose bike usage drops drastically during the beginning and end of year

Since we observed the weekly trend of casual and member riders throughout the week, let’s further look at this by analysing the number of rides taken across membership throughout the week, month, and year.

We first explore whether there are any variation in the number of rides throughout the year.

There is much lower number of rides between November and February. Here, we also see that the Cylictic usage peaks at similar time every year, always peaking at either July or August.

But how do membership status differs across this time?

Much higher casual riders were generally observed during the mid-year, around May to August, where there were similar number of casual and member riders. Member riders however outnumbered casual riders during the beginning and end of year, especially at year end.

Looking at the daily breakdown of ride frequency, we observed that casual riders outnumbered members on weekends (Saturday and Sunday), with members being a more frequent rider on the other days, as we have observed previously. Let’s see if this is true across the year.

We observed that the higher frequency of casual riders during the weekend was only true during peak months as observed earlier, and were much higher in 2021 compared to 2022 and 2020. There was much less casual riders compared to members during November and December throughout the 3 year period.

Compiling our findings

Let’s review our findings:

Members ride shorter trip than casual riders
Members and casuals dominated different routes
Members preferred classic bikes whereas casuals preferred docked and electric bikes
Members dominated weekdays; casuals on weekends
Rides peaked during yearly Q2-Q3, with Q1&Q4 dominated by members

We can generate some hypotheses based on our observation.

Members likely utilised bike rides for regular commutes, with casuals more for recreational and weekend activities. This is supported by the higher frequency of casuals on the weekends and the more regular ride frequency by members throughout the year. Longer ride duration by casuals could reflect a less time-contrained trip due to the leisure nature of weekend activities. Also, casuals dominated stations which are likely near recreational areas such as lake shore drive, which runs along the Chicago coastline and beaches.
Members prefer classic bikes to other bike types despite options provided in their stations. Classic bikes might be more cost effective, or are more ubiquitous.

Based on our findings and hypotheses, we provide the following data-driven recommendations for the Director of Cyclistic:

A subscription-based method which costs less than what the typical rider pays for multiple one-time passes may be persuasive enough to convert these regular casuals into members. This approach is suitable if the company’s goal is to capture more members and is willing to cut revenue in the short trm, since the director has identified members to be more profitable in the long run.
Casual->Member conversion strategy targeted to areas with high proportion of casual riders such as Streeter Drive and Lake Shore Drive.
Weekend-only membership plan for regular casual riders who frequently used the bike on the weekends, but are not members due to the lack of usage on weekdays, hence deterring a subscription plan (which could be more expensive than purchasing day passes if only used on the weekends).
Limit the usage of non-classic bikes as a privilege to members only, which could entice casuals to convert their membership. Warning: This option should be carefully evaluated with the market demands of Cyclistic bikes, as it could backfire and lead to reduced usage of bikes without any membership conversion.

And some recommended data to collect to validate our recommendations:

Unique rider ID: Informs how frequently/regularly members and casuals utilized the bike rides
Ride cost and availability for different bike types across different locations: Informs motivation for bike type choice preference
Ride availability at different locations: Informs bike type choice preference
Expert input on which stations and routes constitute hot commute or vacation area: Validates commuting/recreational hypothesis