Chapter 3 Data transformation
3.1 Food delivery data
Note: Given that we have multiple types of data obtained through different sources, the workload of the data cleaning phase is significant.
For food delivery data, we removed unrelated columns and selected rows based on multiple factor levels - “Online Food Delivery” for Market, and “Revenue” and “Users by Segment” for Chart. We create a new vector called “YEAR”, from the information stored in the specifying columns from 3 to 7. Then, we complete the tidying by pivoting “Chart” and “value” to make “Revenue” and “Users by Segment” columns.
## # A tibble: 6 x 3
## region all_food_delivery online_food_delivery
## <chr> <dbl> <dbl>
## 1 United States 0.95 1.23
## 2 Russia 0.89 1.17
## 3 Canada 0.69 1.13
## 4 Australia 0.61 0.61
## 5 Brazil 0.51 0.68
## 6 Japan 0.43 0.68
## # A tibble: 6 x 2
## Year food_delivery_share_in_food_service
## <chr> <dbl>
## 1 2013 0.061
## 2 2014 0.064
## 3 2015 0.067
## 4 2016 0.069
## 5 2017 0.072
## 6 2018 0.076
## # A tibble: 6 x 4
## Name YEAR Revenue `Users by Segment`
## <chr> <chr> <dbl> <dbl>
## 1 Total 2017 23.4 81.2
## 2 Total 2018 26.5 89.6
## 3 Total 2019 31.0 101.
## 4 Total 2020 44.1 132.
## 5 Total 2021 56.9 150.
## 6 Platform-to-Consumer Delivery 2017 9.26 40.1
3.2 Supply data
3.2.1 Education
Due to the changes in the information collection process, some data were treated as missing data. For the breakdown of education levels in the dataset, it will be a large and unnecessary workload in future comparative analysis, so we divided the education level into 9 major categories. After this, we created the “YEAR” variable as we did in the previous sections and assign the summation of value to it.
## # A tibble: 6 x 3
## Year degree value
## <chr> <chr> <dbl>
## 1 2010 Associate's degree, academic 9725
## 2 2010 Associate's degree, vocational 8533
## 3 2010 Bachelor's degree1 30162
## 4 2010 College, no degree 33662
## 5 2010 Elementary or High school, GED 62457
## 6 2010 Elementary or High school, no diploma 25712
3.2.2 Unemployment
In this section, the data we collected is on monthly basis. To maintain consistency of the previous data, we calculated the average unemployment rate by year.
3.2.4 Working earning
- Combine unemployment, restaurant number and working earning
## # A tibble: 6 x 4
## Year working_earning restaurant_number avg_unemployment
## <dbl> <dbl> <dbl> <dbl>
## 1 2010 11.5 555688 9.62
## 2 2011 11.7 565632 8.95
## 3 2012 11.9 577281 8.07
## 4 2013 12.1 584843 7.38
## 5 2014 12.5 593346 6.17
## 6 2015 12.9 602294 5.29
3.3 Demand data
3.3.1 Smartphone ownership
Removed irrelevant rows
Changed column names and date time format into desired data type
## # A tibble: 6 x 2
## Date proportion
## <date> <dbl>
## 1 2011-04-01 0.35
## 2 2012-01-01 0.45
## 3 2012-03-01 0.46
## 4 2012-08-01 0.45
## 5 2012-10-01 0.47
## 6 2012-11-01 0.45
3.3.2 Food expenditures
Removed irrelevant columns
## # A tibble: 6 x 3
## Year `Food at home` `Food away from home`
## <dbl> <dbl> <dbl>
## 1 1997 353664. 265069.
## 2 1998 364084. 278004.
## 3 1999 386318. 292034.
## 4 2000 404815. 313060.
## 5 2001 423349. 326063.
## 6 2002 428442. 333839.
3.3.4 CPI
After removing the irrelevant columns, we converted the multiple monthly data of “CPI for all” and “CPI for food” to the new column. Then, we adjusted the format of the time to “Year” and “Month”. Since the two datasets now have a consistent scale and format, we merge the two data frames together by time.
- Combine CPI and working hours
## # A tibble: 6 x 6
## Year Month food_cpi date all_cpi week_working_hour
## <dbl> <chr> <dbl> <date> <dbl> <dbl>
## 1 2012 Jan 0.6 2012-01-01 0.4 34.5
## 2 2012 Feb 0 2012-02-01 0.4 34.5
## 3 2012 Mar 0.1 2012-03-01 0.8 34.4
## 4 2012 Apr 0.2 2012-04-01 0.3 34.5
## 5 2012 May 0.1 2012-05-01 -0.1 34.4
## 6 2012 Jun 0.1 2012-06-01 -0.1 34.4