---

title: "Cyclistic bike-share Data Analysis"

author: "Daniel Draney"

date: "2024-09-18"

output:

  html_document: default

  pdf_document: default

---


```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

```


## Background on this project and the company

This project was part of my Google Data Analysis course and I worked as a junior data analyst on the marketing analyst team at Cyclistic, a fictional bike-share company in Chicago. The company is looking to understand how they can maximize the number of annual memberships. To do that, I needed to analyze how casual riders and riders with annual memberships are using the Cyclistic bikes differently. 


In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. 


This use case presented a unique opportunity for me to analyze different types of data and further hone some of my newest skills in data analysis and the R programming language. Work on this project followed my typical data analysis process which includes the following steps: *Ask*, *Prepare*, *Process*, *Analyze*, *Share*, *Act*, and *Reflect*. You can see more about my approach on analysis [here](https://sites.google.com/u.boisestate.edu/danieldraney/my-analysis-approach?authuser=0). Continue reading to see the details of what I learned in each step of the process during this project and the skills and thought process I deployed.



## Ask

* Three questions are asked of the team for this project:

   1. How do annual members and casual riders use Cyclistic bikes differently?

   2. Why would casual riders buy Cyclistic annual memberships?

   3. How can Cyclistic use digital media to influence casual riders to become members?

* I have a great team available to help me in this project. In addition to the market analyst team, Moreno, the director of marketing has given great clarity and support on what is needed to achieve success.

* **The Business Task:** I will be focusing on the first question and will be looking at how different categories of riders use the bikes differently. Understanding this data will help inform the rest of the team on the additional questions.


## Prepare

* The timeline for this project is pretty quick and should be able to be accomplished in just a few days. 

* The data necessary for this work is available in a public dataset which can be found [here](https://divvy-tripdata.s3.amazonaws.com/index.html>).

* I'll begin by downloading the data for the last 12 months and prepare it for analysis.

* The initial analysis will include:

   1. How is the data organized?

   2. Sort and filter the data

   3. Determine the credibility of the data

* **Data Sources Used:** All data sources used for this project were provided by the Cyclystic team and is publicly available for download through the link above.


```{r import the data to prepare for processing, include=FALSE}

install.packages("tidyverse")

library(tidyverse)  #helps wrangle data

# Use the conflicted package to manage conflicts

library(conflicted)


# Set dplyr::filter and dplyr::lag as the default choices

conflict_prefer("filter", "dplyr")

conflict_prefer("lag", "dplyr")

conflicts_prefer(dplyr::filter)


#=====================

# STEP 1: COLLECT DATA

#=====================

# # Upload Divvy datasets (csv files) here

q1_2019 <- read_csv("Divvy_Trips_2019_Q1.csv")

q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")

```

   

## Process

* Now that I've downloaded the data and done the initial preparatory analysis, I need to process the data to get from raw to clean.

* I will be using R Studio for the primary work on this project, with Excel for some of the initial analysis and data verification.

* My entire cleaning, filtering, transforming, and bias checking of the data will be done within R so that I can maintain a record of the work, document any cleaning of the data, as well as to make the process repeatable.

* **Data Cleaning and Manipulation:** See the R Markdown file for the specifics and the code used, but there were four primary steps:

   1. Normalize the fields in the 2019 and 2020 datasets so that the data could be aggregated.

   2. Calculate the length of time for each ride segment.

   3. Calculate the date parts of the date fields for easier summary.

   4. Remove data with negative ride times and quality control records.

   

```{r process and combine the data to prepare for cleansing, include=FALSE}

#====================================================

# STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE

#====================================================

# Compare column names each of the files

# While the names don't have to be in the same order, they DO need to match perfectly before we can use a command to join them into one file

colnames(q1_2019)

colnames(q1_2020)


# Rename columns  to make them consistent with q1_2020 (as this will be the supposed going-forward table design for Divvy)


(q1_2019 <- rename(q1_2019

                   ,ride_id = trip_id

                   ,rideable_type = bikeid

                   ,started_at = start_time

                   ,ended_at = end_time

                   ,start_station_name = from_station_name

                   ,start_station_id = from_station_id

                   ,end_station_name = to_station_name

                   ,end_station_id = to_station_id

                   ,member_casual = usertype

))


# Inspect the dataframes and look for incongruencies

str(q1_2019)

str(q1_2020)


# Convert ride_id and rideable_type to character so that they can stack correctly

q1_2019 <-  mutate(q1_2019, ride_id = as.character(ride_id)

                   ,rideable_type = as.character(rideable_type)) 


# Stack individual quarter's data frames into one big data frame

all_trips <- bind_rows(q1_2019, q1_2020)#, q3_2019)#, q4_2019, q1_2020)


# Remove lat, long, birthyear, and gender fields as this data was dropped beginning in 2020

all_trips <- all_trips %>%  

  select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender,  "tripduration"))


```




```{r clean and manipulate the data to prepare for analysis, include=FALSE}

#======================================================

# STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS

#======================================================

# Inspect the new table that has been created - commented out items below for purposes of including without breaking the Knit

# colnames(all_trips)  #List of column names

# nrow(all_trips)  #How many rows are in data frame?

# dim(all_trips)  #Dimensions of the data frame?

# head(all_trips)  #See the first 6 rows of data frame.  Also tail(all_trips)

# str(all_trips)  #See list of columns and data types (numeric, character, etc)

# summary(all_trips)  #Statistical summary of data. Mainly for numerics


# There are a few problems we will need to fix:

# (1) In the "member_casual" column, there are two names for members ("member" and "Subscriber") and two names for casual riders ("Customer" and "casual"). We will need to consolidate that from four to two labels.

# (2) The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data -- such as day, month, year -- that provide additional opportunities to aggregate the data.

# (3) We will want to add a calculated field for length of ride since the 2020Q1 data did not have the "tripduration" column. We will add "ride_length" to the entire dataframe for consistency.

# (4) There are some rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.


# In the "member_casual" column, replace "Subscriber" with "member" and "Customer" with "casual"

# Before 2020, Divvy used different labels for these two types of riders ... we will want to make our dataframe consistent with their current nomenclature

# N.B.: "Level" is a special property of a column that is retained even if a subset does not contain any values from a specific level

# Begin by seeing how many observations fall under each usertype

# table(all_trips$member_casual) #commenting out so the view doesn't break the Knit


# Reassign to the desired values (we will go with the current 2020 labels)

all_trips <-  all_trips %>% 

  mutate(member_casual = recode(member_casual, "Subscriber" = "member", "Customer" = "casual"))


# Check to make sure the proper number of observations were reassigned

# table(all_trips$member_casual) # commenting out so the view doesn't break the Knit


# Add columns that list the date, month, day, and year of each ride

# This will allow us to aggregate ride data for each month, day, or year ... before completing these operations we could only aggregate at the ride level

# https://www.statmethods.net/input/dates.html more on date formats in R found at that link

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd

all_trips$month <- format(as.Date(all_trips$date), "%m")

all_trips$day <- format(as.Date(all_trips$date), "%d")

all_trips$year <- format(as.Date(all_trips$date), "%Y")

all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")


# Add a "ride_length" calculation to all_trips (in seconds)

# https://stat.ethz.ch/R-manual/R-devel/library/base/html/difftime.html

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)


# Add a field to check if the ride length is greater than one day, for further analysis

 all_trips <-  all_trips %>% 

  mutate(ride_length_days = (ride_length / 60) / 24 ) #divide by 60 for minutes, divide by 24 for hours


# Inspect the structure of the columns

# str(all_trips) # commenting out so the view doesn't break the Knit


# Convert "ride_length" from Factor to numeric so we can run calculations on the data

# is.factor(all_trips$ride_length) # commenting out so the view doesn't break the Knit

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))

# is.numeric(all_trips$ride_length)# commenting out so the view doesn't break the Knit


# Remove "bad" data

# The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative

# We will create a new version of the dataframe (v2) since data is being removed

# https://www.datasciencemadesimple.com/delete-or-drop-rows-in-r-with-conditions-2/

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

all_trips_v2 <- all_trips_v2 %>% arrange(-ride_length)

```



## Analyze

* We've asked questions, and also prepared and processed the data, now we need to analyze the data to be able to make recommendations.

* **Summary of Analysis:**

   * Casual riders take longer rides on average than members

   * Casual riders take fewer rides on average than members

   * Casual riders take 50% more long rides than members

``` {r analyze the data, include=FALSE}

#=====================================

# STEP 4: CONDUCT DESCRIPTIVE ANALYSIS

#=====================================

# Descriptive analysis on ride_length (all figures in seconds)

mean(all_trips_v2$ride_length) #straight average (total ride length / rides)

median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths

max(all_trips_v2$ride_length) #longest ride 

min(all_trips_v2$ride_length) #shortest ride


# You can condense the four lines above to one line using summary() on the specific attribute

summary(all_trips_v2$ride_length)


# Compare members and casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)


# See the average ride time by each day for members vs casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)


# Notice that the days of the week are out of order. Let's fix that.

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))


# Now, let's run the average ride time by each day for members vs casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)


# analyze ridership data by type and weekday

all_trips_v2 %>% 

  mutate(weekday = wday(started_at, label = TRUE)) %>%  #creates weekday field using wday()

  group_by(member_casual, weekday) %>%  #groups by usertype and weekday

  summarise(number_of_rides = n() #calculates the number of rides and average duration 

            ,average_duration = mean(ride_length)) %>% # calculates the average duration

  arrange(member_casual, weekday) # sorts

```



## Share

* **Let's take a look at a few of the key findings**

* A visual of the average lengths of rides for casual riders compared to member riders

* Casual riders take longer rides on average

```{r plot of ride length by rider type, echo=FALSE, message = FALSE}

# Let's create a visualization for average duration, for rides on only a single day

all_trips_v2 %>% 

  filter(ride_length_days <= 1) %>%

  mutate(weekday = wday(started_at, label = TRUE)) %>% 

  group_by(member_casual, weekday) %>% 

  summarise(number_of_rides = n()

            ,average_duration = mean(ride_length)/60) %>% #divide by 60 to show in minutes

  arrange(member_casual, weekday)  %>% 

  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +

  geom_col(position = "dodge") +

  labs(title = "Ride Duration by Rider Type < 1 Day") +

  xlab("Day of Week") + 

  ylab("Minutes per Ride")

```


* A visual of the number of rides for casual riders compared to member riders

* Casual riders take fewer rides on average

```{r plot of number of rides by rider type, echo=FALSE, message = FALSE}

# Let's visualize the number of rides by rider type

all_trips_v2 %>% 

  mutate(weekday = wday(started_at, label = TRUE)) %>% 

  group_by(member_casual, weekday) %>% 

  summarise(number_of_rides = n()

            ,average_duration = mean(ride_length)/60) %>% #divide by 60 to show in minutes

  arrange(member_casual, weekday)  %>% 

  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +

  geom_col(position = "dodge") +

  labs(title = "Average Rides by Rider Type") +

  xlab("Day of Week") + ylab("Number of Rides")

```


* A visual of the number of rides that are longer than one day, per rider type

* Casual riders take 50% more long rides than members

```{r plot of number of rides more than one day long by rider type, echo=FALSE, message = FALSE}

# Let's create a visualization for number of rides greater than one day

all_trips_v2 %>% 

  filter(ride_length_days > 1) %>%

  group_by(member_casual, ride_length_days) %>% 

  summarise(number_of_rides = n())%>% 

  arrange(member_casual, ride_length_days)  %>% 

  ggplot(aes(x = member_casual, y = ride_length_days, fill=member_casual)) +

  geom_col(position = "dodge")+

  labs(title = "Number of Rides More than One Day") +

  xlab("Member Type") + ylab("Rides More than One Day")

```


## Act

* **Recommendations**

   1. On average, Casual Riders ride longer. Since casual riders pay by the length of the ride, if we market to the casual riders that a membership will let them ride more frequently at a better price per ride, it could attract more members.

   2. Casual Riders take fewer rides. If we market to casual riders that the membership option pays for itself with as few as x number of rides, it could interest them in a membership rather than paying for individual rides.

   3. Casual Riders take more rides that are longer than one day than members, paying for multiple day passes for a single ride. If we position membership prices the right way to casual riders, they'll see the benefit of membership over paying for the multiple day passes for a single ride.


## Reflect

* **With the project complete, we're now at the phase of reflection.**

* What have I learned? What can I take away from this project to use in the future?

* This was my first case study to complete as part of my Data Analysis course. I was able to synthesize a lot information and use the skills I learned for this compact project.

* Using my Data Analysis Checklist from the previous course helped me to align a plan for what steps needed to happen.

* I can repeat this process, with a template now and continue to grow my skill set!

* I also learned a bit about the limitations of free online tools like RStudio with regards to how much storage can be used.

* Additionally, I was able to learn more about using R programming to accomplish data visualization.

* Lastly, I was reminded the importance of managing time and preparing for the unplanned so that when tools don't work as expected, there is time remaining to complete the work.