You can see a sample business deck summarising the key findings and recommendations based on insights I generated from the dataset described in this page here.
I recently completed my Google Data Analytics Professional certificate and at the end of the course was provided with this set of real world transportation data from a bike-sharing company which has been anonymysed as belonging to a fictional company called ‘Cyclistic’, and were given instructions to run analytics on it. The followings are the fictional plots that have been provided alongside the public data:
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
From the background given, the finance analyst of Cyclistic has reported that annual member generates more profit for the company than casual riders, and the director has instructed for a new marketing campaign aiming to convert existing casual riders into annual members. To provide a good recommendation, we will dive into the available data to understand the preferences and profiles of riders based on their membership status to come up with the most effective market strategy to convert casual into annual members.
Problem statement: How to convert Cyclistic’s casual riders into annual members?
The instruction brought us to a AWS S3 bucket index page here which provides the links for all the data to be downloaded. However, I wasn’t able to bulk download the data directly through AWS command line interface; for some reason my access was always restricted.
Instead of clicking each link and downloading them manually, I utilised the Ubuntu CLI to scrape the URL links from the index page, then loop a simple wget command to download all files in one command.
First off, save the index page https://divvy-tripdata.s3.amazonaws.com/index.html as a text file. I then open the text file in a Ubuntu terminal and used grep, sed and awk to filter only the working URLs to each of the sample data.
cat divvy_text.txt | \ # read file
grep https | \ # filter only line with https pattern
awk -F "\t" '{print $1}' | \ # print only the first variable in line, separated by a tab
sed 's/<//' | \ # remove the character < before each https link
sed 's/>//' | \ # remove the character > after each https link
sed '$ d' \ # remove last line
> divvy_url.txt # save as new file
cat divvy_text.txt | grep https | awk -F "\t" '{print $1}' | sed 's/<//' | sed 's/>//' | sed '$ d' > divvy_url.txt
Afterwards, we can simply bulk download all the sample files using the following wget command
wget -i divvy_url.txt
There are a lot of zipped files here! I’m not sure if I am supposed to only use one of them or all of them at this stage, but to be safe, I unzip all of them in the terminals using the following for-loop:
for line in `ls | sed 's/.zip//'`; do unzip ${line}.zip -d ${line}; done # for loop to unzip all files into their individual folders
rm *zip # delete zip files
mv */*csv . # move only csv file into main directory
rm -rf `ls -d */` # remove folders after copying the csv files
Now all the raw data is in a single directory. Let’s view them:
## [1] "202004-divvy-tripdata.csv" "202005-divvy-tripdata.csv"
## [3] "202006-divvy-tripdata.csv" "202007-divvy-tripdata.csv"
## [5] "202008-divvy-tripdata.csv" "202009-divvy-tripdata.csv"
## [7] "202010-divvy-tripdata.csv" "202011-divvy-tripdata.csv"
## [9] "202012-divvy-tripdata.csv" "202101-divvy-tripdata.csv"
## [11] "202102-divvy-tripdata.csv" "202103-divvy-tripdata.csv"
## [13] "202104-divvy-tripdata.csv" "202105-divvy-tripdata.csv"
## [15] "202106-divvy-tripdata.csv" "202107-divvy-tripdata.csv"
## [17] "202108-divvy-tripdata.csv" "202109-divvy-tripdata.csv"
## [19] "202110-divvy-tripdata.csv" "202111-divvy-tripdata.csv"
## [21] "202112-divvy-tripdata.csv" "202201-divvy-tripdata.csv"
## [23] "202202-divvy-tripdata.csv" "202203-divvy-tripdata.csv"
## [25] "202204-divvy-tripdata.csv" "202205-divvy-tripdata.csv"
## [27] "202206-divvy-tripdata.csv" "202207-divvy-tripdata.csv"
## [29] "202208-divvy-tripdata.csv" "202209-divvy-publictripdata.csv"
## [31] "202210-divvy-tripdata.csv" "202211-divvy-tripdata.csv"
## [33] "Divvy_Stations_2015.csv" "Divvy_Stations_2016_Q3.csv"
## [35] "Divvy_Stations_2016_Q4.csv" "Divvy_Stations_2017_Q1Q2.csv"
## [37] "Divvy_Stations_2017_Q3Q4.csv" "Divvy_Trips_2014_Q1Q2.csv"
## [39] "Divvy_Trips_2015-Q1.csv" "Divvy_Trips_2015-Q2.csv"
## [41] "Divvy_Trips_2015_07.csv" "Divvy_Trips_2015_08.csv"
## [43] "Divvy_Trips_2015_09.csv" "Divvy_Trips_2015_Q4.csv"
## [45] "Divvy_Trips_2016_Q3.csv" "Divvy_Trips_2016_Q4.csv"
## [47] "Divvy_Trips_2017_Q1.csv" "Divvy_Trips_2017_Q2.csv"
## [49] "Divvy_Trips_2017_Q3.csv" "Divvy_Trips_2017_Q4.csv"
## [51] "Divvy_Trips_2018_Q1.csv" "Divvy_Trips_2018_Q2.csv"
## [53] "Divvy_Trips_2018_Q3.csv" "Divvy_Trips_2018_Q4.csv"
## [55] "Divvy_Trips_2019_Q1.csv" "Divvy_Trips_2019_Q2.csv"
## [57] "Divvy_Trips_2019_Q3.csv" "Divvy_Trips_2019_Q4.csv"
## [59] "Divvy_Trips_2020_Q1.csv"
We will utilise data from 2020-2022 to get the trend of the last 3 years. Since each month’s trip data is saved to a different file, we will merge them together to facilitate a thorough analysis.
library(tidyverse)
tripdatalist = list.files(path = 'input/raw', pattern = 'tripdata.csv', full.names = T)
tripdataall =
lapply(tripdatalist, function(x) {
out =
read_csv(x) %>%
mutate(start_station_id = as.character(start_station_id),
end_station_id = as.character(end_station_id)) # to prevent error when merging due to incompatible variable type
})
tripdata = dplyr::bind_rows(tripdataall)
rm(tripdataall) # remove the list to save memory
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at start_station_n…
## <chr> <chr> <dttm> <dttm> <chr>
## 1 A847FA… docked_bike 2020-04-26 17:45:14 2020-04-26 18:12:03 Eckhart Park
## 2 5405B8… docked_bike 2020-04-17 17:08:54 2020-04-17 17:17:03 Drake Ave & Ful…
## 3 5DD24A… docked_bike 2020-04-01 17:54:13 2020-04-01 18:08:36 McClurg Ct & Er…
## 4 2A59BB… docked_bike 2020-04-07 12:50:19 2020-04-07 13:02:31 California Ave …
## 5 27AD30… docked_bike 2020-04-18 10:22:59 2020-04-18 11:15:54 Rush St & Hubba…
## 6 356216… docked_bike 2020-04-30 17:55:47 2020-04-30 18:01:11 Mies van der Ro…
## # … with 8 more variables: start_station_id <chr>, end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>
I have successfully merged the trip data files together, with a total of 14195770 rows and 13 columns. There are several variables that we can look at, including trip day and time, ride duration, routes and geolocations. Ride ID was also provided but was not referring to unique membership profile, hence we will not use it in our analysis.
We can calculate the trip duration since we have the start and end time for each trip, as well as the distance covered since we have the station geolatitude data. Are there any difference in ride duration between members and casual riders? As the duration data was extremely left skewed, I log-transformed the data to achieve a better distribution before plotting them.
Members had shorter ride duration than non-members (Wilcoxon test, p<0.05). To see whether this is true all the time, I subset the analysis to each individual months to see whether the trend persisted.
The trend seems to be observed throughout the months and years.
Next up, we will evaluate whether casual riders or members are more likely to belong to be concentrated to certain are or whether they are uniformly distributed across Chicago. We can infer this by observing the stations most commonly used by casual riders or members. Since there are a lot of different stations to comfortably visualise, let’s statistically determine which station had significantly different usage between casual riders and members, and visualise those with significantly different membership proportion.
We observed 13 starting stations where the rate of casual riders were significantly higher than members. This is especially true for Streeter Driver & Grand Avenue, as well as Lake Shore Drive & Monroe Street and Millenium Park, with casual riders outnumbering members several folds.
Now let’s run the same analysis for destination station.
Interestingly, 6 stations were dominated by casual riders both as the starting and ending station. These areas might be within leisure zones, resulting in people to be more likely to come sporadically instead of being a member, although this likely needs to be discussed with a subject matter expert.
What if we considered the overall route taken by casual vs members? Let’s find out.
Investigating the route confirmed our hypothesis, where stations dominated by casual riders mainly started and ended at the same station. Although we need more data to support this claim, but the much higher proportion of casual riders in these routes might be indicative of their propensity to be used for recreational activity (hence few regular commuters), although this will need to be confirmed by the subject matter expert. Unfortunately, we also do not have information on the number of unique riders per station, which could support or deny this hypothesis.
Nevertheless, knowing that these stations are dominated by casual riders, let’s further investigate whether this trend is true throughout the year.
Among stations with >50 rides recorded, there doesn’t seem to be a time effect on casual/member riders on the popular routes, with each routes dominated by either member of casual riders throughout the year.
What if we ran the analysis on origin and destination station?
Similar to routes, the distribution of casual and member rides across stations seem to be similar throughout the years.
We managed to identify a few distinction between casual and member
riders so far, but let’s analyse this statistically to provide some
numerical backings to our findings. We will do it by running a
correlation analysis between membership and the numerical variables from
our dataset (duration, latitude, longitude).
The correlation analyses returned some significant features associated with membership status. We have already discussed the correlation between membership and duration previously, so we’ll skip this. The correlation between starting and ending latitudes and longitudes likely relates to the common routes being taken.
One interesting this is the weak negative correlation observed between membership status and start/ending longitude. I will investigate this lead further to see if we can get something.
Although the statistics observed significant correlation between membership status and longitudinal coordinate, this is not immediately visible from the observed graph. Remember from our previous observation where we observed a preference for member/non-members towards certain stations/routes? This correlations could be a reflection of the membership preference, although basing it on latitude/longitude alone would be significantly confounded with noise since some member/non-member dominant stations might belong to a similar latitude but different longitude and vice cersa.
There seems to be different proportion of memberships utilising each of three bike types. Members seem to utilise classic bikes more often than casual riders (54.17% of members ride classics vs 42.14% of casuals), while casuals exhibited almost twice the proportion of docked bike rides (31.94%) against members (20.16%), while electric bike usage was similar across membership status (member vs casual, 25.67% vs 25.91%).
Let’s see if this is true throughout the week:
Interestingly, casual riders outnumbered members in all three ride types on the weekend, with member rides being more frequent than casual riders throughout the weekday. Nevertheless, the difference between members and casual riders were much higher in classic bikes compared to the other two types.
Is there a specific stations which are more popular on the weekends or weekdays?
A lot of stations indeed have a higher number of total rides in the weekend compared to the weekday due to the higher count of casual riders. Frequency of member riders were more station-dependent, with some stations seeing a reduction in members riding on the weekends while others see a relatively constant number.
On a side note, this observation also supports our earlier findings where certain routes are dominated heavily by either members or non-members.
Let’s check the proportion of bike types across members among the popular stations.
We observe that bike preference was very route-dependent. Generally, casual riders were more likely to ride docked and electric bikes compared to members, who preferred to stick to classic bike. This likely requires more data, for example classic bike likely costs cheaper to ride than docked or electric bikes, hence regular users might prefer the cheaper option compared to the presumed sporadic ride habits of casual members. More data on unique rider ID and cost of rides is needed to ascertain this hypothesis.
Since we observed the weekly trend of casual and member riders throughout the week, let’s further look at this by analysing the number of rides taken across membership throughout the week, month, and year.
We first explore whether there are any variation in the number of rides throughout the year.
There is much lower number of rides between November and February. Here, we also see that the Cylictic usage peaks at similar time every year, always peaking at either July or August.
But how do membership status differs across this time?
Much higher casual riders were generally observed during the mid-year, around May to August, where there were similar number of casual and member riders. Member riders however outnumbered casual riders during the beginning and end of year, especially at year end.
Looking at the daily breakdown of ride frequency, we observed that casual riders outnumbered members on weekends (Saturday and Sunday), with members being a more frequent rider on the other days, as we have observed previously. Let’s see if this is true across the year.
We observed that the higher frequency of casual riders during the weekend was only true during peak months as observed earlier, and were much higher in 2021 compared to 2022 and 2020. There was much less casual riders compared to members during November and December throughout the 3 year period.
Let’s review our findings:
We can generate some hypotheses based on our observation.
Based on our findings and hypotheses, we provide the following data-driven recommendations for the Director of Cyclistic:
And some recommended data to collect to validate our recommendations: