# Zoom Through the New Normal Taxi Trips

#### 13 March 2023

Data Cleaning and Exploratory Data Analysis (EDA) on The Last One Year of Chicago Taxi Trips Dataset

As the pandemic slowly recedes, people are starting to travel more. They are revisiting their families and holiday destinations, and with businesses reopening, they are resuming their weekday commutes. Due to those reasons, it will be interesting to see the current mobility patterns: travel patterns after the pandemic.

For this project, taxi rides were chosen as the object of observation. Taxis are modes of transport that are publicly accessible. However, unlike public transportation, there is no schedule and route constraint –a higher possibility that the data possesses information on the trips through day and night in all neighborhoods.

Though it will be interesting to look at the effects of the pandemic on the taxi industry, this project is limited to the last one-year’s taxi trips only. There will be no data analysis for the taxi trips before and during the pandemic (when the stay-at-home order was active).

We will see the current mobility pattern by doing two processes in this project: data cleaning to improve data quality and exploratory data analysis (EDA) to describe the characteristics of the trips.

## Dataset

This project will study the taxi trips in the Chicago Taxi Trips Dataset, one of the datasets available on the Chicago Data Portal. While working on this project, the dataset contains taxi trip data from 2013 to 2023.

We can potentially study when the trip happened, the duration and distance, the pickup and dropoff location, the amount paid, the payment method, and the taxi company from each record of this dataset. Below is the description for all variables arranged into eight broad groups.

- ID

Two IDs in the dataset,*Trip.ID*and*Taxi.ID*, are the unique identifiers for the trip and the taxi. - timestamp
*Trip.Start.TimeStamp*and*Trip.End.Timestamp*show when the trip started and ended. The time is rounded to the nearest 15-minute interval to avoid passenger reidentification. - duration of the trip
*Trip.Seconds*shows trip duration in seconds. - distance of the trip
*Trip.Miles*marks travel distance in miles. - location
*Pickup.Census.Tract*,*Pickup.Community.Area*,*Pickup.Centroid.Latitude*,*Pickup.Centroid.Longitude*, and*Pickup.Centroid.Location*(point of both latitude and longitude) depict the pickup location. At the same time, the*Dropoff.Census.Tract*,*Dropoff.Community.Area*,*Dropoff.Centroid.Latitude*,*Dropoff.Centroid.Longitude*, and*Dropoff.Centroid..Location*(point of both latitude and longitude) represent the dropoff location.

These locations are imprecise to protect the passengers’ privacy. Moreover, the record was blank for locations outside Chicago. - payment amount

The trip costs are in*Fare*,*Tips*,*Tolls*,*Extras*, and*Trip.Total*columns. - payment type
*Payment.Type*variable tells the payment method. - taxi company

The*Company*column has the taxi company information.

As a public dataset, Chicago Taxi Trips Dataset is available for anyone to access and use.

## Prepare

### Select the relevant data

This project aims to clean and perform exploratory data analysis (EDA) for the last one-year data of the Chicago Taxi Trips Dataset. Since I conducted this project in March, the latest one-year data was from March 2022 to February 2023.

After downloading 12 months of records from the Taxi Trip page in the Chicago Data Portal, I managed to get more than 6 million rows of raw data.

### Identify problems in the dataset

To accurately convey the insight from the data, having a clean one is crucial. Not only ensure the reliability of the prospective statistical analysis, early identified and resolved issues will simplify it.

The methods to pinpoint the issues from the data are listed below.

- Confirm the dataset completeness
- Look for duplicate
- Determine irrelevant column
- Locate missing value
- Spot any outliers
- Review datatypes

#### Confirm the dataset completeness

To reasonably illustrate the trip patterns throughout the year (12 months), I must ensure the data covers virtually all the days in each month. For this project, I defined a *month* as complete if it has: (1) the first day of the month, (2) the last day of the month, and (3) 90% of the total days (26 out of 28 days, 27 out of 30 days, or 28 out of 31 days).

```
# Sort dataset based on the time the trips were started,
# then and print (1) the the first and last row, and (2) number of unique days in each month
for (i in 1:12) {
df <- eval(parse(text = paste0("raw_df_", i) )) %>% arrange(Trip.Start.Timestamp)
n_days <- n_distinct(substr(df$Trip.Start.Timestamp, 1, 10)) - 1
print(paste0("first: ", min(df[1, 3]), " - last: ", max(df[nrow(df), 3]), " - n_days: ", n_days))
rm(df)
}
```

The image above shows that, by my definition, the datasets are complete.

#### Look for duplicate

Chicago Taxi Trips Dataset owns a unique primary key, *Trip.ID*, that avoids data duplication. Still, I performed this duplicate finding process to confirm that supposition.

```
for (j in 1:12) {
df <- eval(parse(text = paste0("raw_df_", j)))
if (dim(df %>% distinct())[1] == dim(df)[1]) {
print(paste0("month ", j, ": no duplicate"))
} else {
print(paste0("month ", j, ": has duplicate"))
}
rm(df)
}
```

From the result above, we can conclude that no duplicated row exists in the dataset.

#### Determine irrelevant column

Unique identifications will be unnecessary for later analysis. Thus, I will remove *Trip.ID *and *Taxi.ID*. Furthermore, I will only keep the *Pickup.Centroid.Location* and *Dropoff. Centroid..Location* among all location variables since they already have latitude and longitude information for the census tract or community area.

#### Locate missing value

```
table <- data.frame(raw_df_1)
missing <- table %>% sapply(function(x) sum(is.na(x) | x == ""))
rm(table)
for (i in 2:12) {
table <- eval(parse(text = paste0("raw_df_", i)))
missing <- missing %>% rbind(table %>% sapply(function(x) sum(is.na(x) | x == "" )))
rm(table)
}
```

Two forms of missing values, the no value available (NA) and the blank space (“”), were in this raw dataset.

#### Spot any outliers

Outliers are extreme values in data that were far from the other values. These values can skew the data and affect the accuracy of statistical analyses.

We can spot the outliers using a box plot. Unlike previous identification, I combined datasets from all 12 months to spot outliers.

```
## List of numeric columns
num_cols = list("Trip.Seconds", "Trip.Miles", "Fare", "Tips", "Tolls", "Extras", "Trip.Total")
## Box Plot with outliers marked in red
myplot <- function(col){
raw_df %>% na.omit() %>%
ggplot(aes(y= eval(parse(text = col)))) +
geom_boxplot(outlier.colour = "red", outlier.shape = 1)+
labs (x = col, y = "") +
theme(axis.text.x = element_blank())
}
## Plot all numeric columns
plot_list <- lapply(num_cols, myplot)
ggarrange(plotlist = plot_list,
nrow = 5, ncol = ceiling(length(plot_list)/5))
```

Boxplot shows that all numeric columns in raw dataset contain outliers (the red circles).

#### Review datatype

From the figure above, we can see that some columns have wrong datatypes.

- The
*Trip.Start.Timestamp*and*Trip.End.Timestamp*should have a date-time type instead of a character. - The
*Pickup.Centroid.Location*and*Dropoff.Centroid..Location*must be converted to points; hence we can position them on a map.

## Data Cleaning

After identifying all problems in the data, the next step is to fix them.

### Remove irrelevant columns

This procedure removed ten of all the variables in the data: *Trip.ID*, *Taxi.ID*, *Pickup.Census.Tract*, *Dropoff.Census.Tract*, *Pickup.Community.Area*, *Dropoff.Community.Area*, *Pickup.Centroid.Latitude*, *Pickup.Centroid.Longitude*, *Dropoff.Centroid.Latitude*, and *Dropoff.Centroid.Longitude*.

```
df <- df %>%
subset(select = -c(Trip.ID, Taxi.ID, Pickup.Census.Tract, Dropoff.Census.Tract, Pickup.Community.Area, Dropoff.Community.Area, Pickup.Centroid.Latitude, Pickup.Centroid.Longitude, Dropoff.Centroid.Latitude, Dropoff.Centroid.Longitude))
```

### Handle missing values

I address the missing values problem by deleting the rows containing them. This procedure is doable since the taxi trip datasets have many rows.

```
# Drop missing values
df[df == ""] <- NA
df[df == ""] <- NA
df <- df %>% drop_na()
```

This process removed 896,438 rows (13.64%) of data.

### Handle the outliers

Capping their value is one way to handle outliers: replacing the outliers’ values with the upper bound and lower bound.

This technique utilizes the Interquartile Range (IQR) of the data. The IQR is a measure of the spread of the data. It is the difference between the data’s third quartile (Q3) and the first quartile (Q1). Any data point that falls below *the lower bound *(Q1 – 1.5 * IQR) or above *the upper bound *(Q3 + 1.5 * IQR) is an outlier, and the lower or upper bound can replace their value, respectively.

```
for (col in num_cols) {
Q <- quantile(df[[col]], probs = c(.25, .75))
iqr <- IQR(df[[col]])
upper_bound <- Q[2] + 1.5*iqr
lower_bound <- Q[1] - 1.5*iqr
df[[col]] <- ifelse(df[[col]] > upper_bound, upper_bound, df[[col]])
df[[col]] <- ifelse(df[[col]] < lower_bound, lower_bound, df[[col]])
}
```

No red circles on the boxplots indicate no outliers —their values have been substituted with *lower* or *upper bound* values.

### Change columns datatypes

#### timestamp

The datatype of *Trip.Start.Timestamp* and *Trip.End.Timestamp* were converted from a string into a date-time datatype using* the lubridate* library.

```
df$Trip.Start.Timestamp <- df$Trip.Start.Timestamp %>% parse_date_time("%m/%d/%y %I:%M:%S %p", tz = "America/Chicago")
df$Trip.End.Timestamp <- df$Trip.End.Timestamp %>% parse_date_time("%m/%d/%y %I:%M:%S %p", tz = "America/Chicago")
```

The new datatypes for Trip.End.Timestamp and Trip.End.Timestamp are *POSIXct*.

#### location

The string on *Pickup.Centroid.Location* and *Dropoff.Centroid..Location* were converted to latitude-longitude points by utilizing the *simple feature* library.

```
df <- df %>% st_as_sf(wkt = "Pickup.Centroid.Location")
df <- df %>% data.frame() %>% st_as_sf(wkt = "Dropoff.Centroid..Location")
```

The new datatypes for ‘Pickup.Centroid.Location’ and ‘Dropoff.Centroid..Location’ are *sfc_POINT*.

## Exploratory data analysis (EDA)

### Summary statistics

The median, mean, and boxplot show that all the variables have right skew distribution except Tolls that only have value of zero.

##### Trip.Seconds (TRAVEL TIME)

The average taxi trip took almost 18 minutes, and nearly half were less than 13 minutes.

##### Trip.Miles (TRAVEL DISTANCE)

The average travel distance was around 5.3 miles, and half were less than 2^{1}/_{4} miles. Less than ^{1}/_{4} of trips were farther than 9.5 miles.

##### Fare

On average, one trip by taxi costs $19.3. Half of the trips cost between $7.5 to $30.

##### Tips

On average, passengers gave $2.2 tips, about $1 more than the median.

##### Tolls

The passengers paid no toll.

##### Extras

On seventy-five percent of the trips, passengers paid $1 or fewer extra payments.

##### Trip.Total (TOTAL Payment)

The median of total payment was almost $15, nearly $3 more than the median fare. About a quarter of the trips, passengers paid more than $32.

### Distribution of values in a *single variable*

#### Date and time

- Each bar on the graph above covers one week, and the height indicates the number of trips in each time range. The histogram starts during the week of March 1, 2022, and ends during the week of February 28, 2023. The week starts on Sunday and ends on Saturday.
- The first week and the last off were incomplete, hence the lack of trips compared to the closest week.
- Interestingly, compared to the week before and after, there were fewer taxi trips during the holidays: Easter, Memorial Day, 4t of July, Labor Day, Thanksgiving, and Christmas.
- There was an unknown reason for the decreasing number of trips in the second week of January 2023.

#### Travel time

- The histogram is right-skewed.
- The most considerable frequency of trips is in the 7-9 minutes range.
- More than half of the trip was less than 15 minutes.
- The data cleaning process substituted outliers’ values with the upper bound. The increased number of trips at the end of the histogram means that the histogram will have quite a long tail without the substitution.

#### Travel distance

- The taxis were unlikely to travel more than 20 miles.
- The highest number of taxi trips is in the 0.5 to 1.5 miles range.

#### Fare

- The highest number of trips is in the $8.5 to $9.5 range, with a longer tail to the right than the left. There are also smaller peaks at $29.5 to $30.5 and the $43.5 to $44.5 range. The occurrence of several peaks in the histogram, known as multimodal, needs further investigation beyond the exploratory data analysis.
- In this graph, the increase in the histogram’s end due to substituting outliers’ values with the upper bound was also observable.

#### Tips

- On all trips, the passenger likely gave no tips to the taxi driver.

#### Extras

- In most trips, taxi riders paid no extra charge.

#### Total Payment

- Similar to the fare distribution, the total payment distribution has three peaks. The comparison makes it more apparent that three groups of unknown categories differentiate the fare. In the first group, the total payment (peak in the $10.5 to $11.5 range) is noticeably higher than the fare (peak in the $8.5 to $9.5 range); in the second group, the total payment (peak in $29.5 to $30.5 range) is almost equal to the fare (peak in $29.5 to $30.5 range), while in the last group, the total payment is remarkably higher than the fare (peak in $43.5 to $44.5 range).

#### Payment.Type

The most popular payment method was credit cards, while the least was prepaid.

#### Company

The two most popular taxi companies were Taxi Affiliation Services and Flash Cab.

#### Pickup.Centroid.Location

#### Dropoff.Centroid..Location

### Relationship between *two variables*

The interaction between the variables is in the figure above.

- The two variables with the highest correlation (0.97) were fare and total payment —meaning any other costs, e.g., tips, toll, and extras, insignificantly affected the whole amount.

- The distance (corr
_{distance.fare}= 0.85, corr_{distance.total}= 0.83) slightly steered the travel cost compared to the duration of the trip (corr_{duration.fare}= 0.82, corr_{duration.total}= 0.79).

- The weakest correlation between numeric variables was between the trip duration and tips.

The correlation mentioned above will be discussed more in the next section.

#### Fare vs. total cost

- Fare and total cost were significantly related. The observations condensed around the linear regression line.

- The blank space in the right below quadrant indicates that all total costs were equal or more pricey than the fare.

#### Fare vs. travel distance and duration

- The fare had a positive correlation with both travel distance and duration.
- The fare rose more constantly as the travel distance expanded compared to the travel duration extended. There was more variation in the fare if a passenger traveled longer in time.

The traffic probably causes it. If a passenger wants to reach the same destination, heavier traffic will slow down the trip than usual; hence same fare but longer travel time.

#### Travel duration vs. distance

#### Trip.Seconds vs Tips

## References and Cleaned Data

### References

avoid passenger reidentification

Exploratory data analysis (EDA, full code)