Chapter 3 Data transformation

3.1 Food delivery data

Note: Given that we have multiple types of data obtained through different sources, the workload of the data cleaning phase is significant.

For food delivery data, we removed unrelated columns and selected rows based on multiple factor levels - “Online Food Delivery” for Market, and “Revenue” and “Users by Segment” for Chart. We create a new vector called “YEAR”, from the information stored in the specifying columns from 3 to 7. Then, we complete the tidying by pivoting “Chart” and “value” to make “Revenue” and “Users by Segment” columns.

## # A tibble: 6 x 3
##   region        all_food_delivery online_food_delivery
##   <chr>                     <dbl>                <dbl>
## 1 United States              0.95                 1.23
## 2 Russia                     0.89                 1.17
## 3 Canada                     0.69                 1.13
## 4 Australia                  0.61                 0.61
## 5 Brazil                     0.51                 0.68
## 6 Japan                      0.43                 0.68
## # A tibble: 6 x 2
##   Year  food_delivery_share_in_food_service
##   <chr>                               <dbl>
## 1 2013                                0.061
## 2 2014                                0.064
## 3 2015                                0.067
## 4 2016                                0.069
## 5 2017                                0.072
## 6 2018                                0.076
## # A tibble: 6 x 4
##   Name                          YEAR  Revenue `Users by Segment`
##   <chr>                         <chr>   <dbl>              <dbl>
## 1 Total                         2017    23.4                81.2
## 2 Total                         2018    26.5                89.6
## 3 Total                         2019    31.0               101. 
## 4 Total                         2020    44.1               132. 
## 5 Total                         2021    56.9               150. 
## 6 Platform-to-Consumer Delivery 2017     9.26               40.1

3.2 Supply data

3.2.1 Education

Due to the changes in the information collection process, some data were treated as missing data. For the breakdown of education levels in the dataset, it will be a large and unnecessary workload in future comparative analysis, so we divided the education level into 9 major categories. After this, we created the “YEAR” variable as we did in the previous sections and assign the summation of value to it.

## # A tibble: 6 x 3
##   Year  degree                                value
##   <chr> <chr>                                 <dbl>
## 1 2010  Associate's degree, academic           9725
## 2 2010  Associate's degree, vocational         8533
## 3 2010  Bachelor's degree1                    30162
## 4 2010  College, no degree                    33662
## 5 2010  Elementary or High school, GED        62457
## 6 2010  Elementary or High school, no diploma 25712

3.2.2 Unemployment

In this section, the data we collected is on monthly basis. To maintain consistency of the previous data, we calculated the average unemployment rate by year.

3.2.3 Restaurant number

Removing irrelevant columns.

3.2.4 Working earning

  • Combine unemployment, restaurant number and working earning
## # A tibble: 6 x 4
##    Year working_earning restaurant_number avg_unemployment
##   <dbl>           <dbl>             <dbl>            <dbl>
## 1  2010            11.5            555688             9.62
## 2  2011            11.7            565632             8.95
## 3  2012            11.9            577281             8.07
## 4  2013            12.1            584843             7.38
## 5  2014            12.5            593346             6.17
## 6  2015            12.9            602294             5.29

3.3 Demand data

3.3.1 Smartphone ownership

Removed irrelevant rows

Changed column names and date time format into desired data type

## # A tibble: 6 x 2
##   Date       proportion
##   <date>          <dbl>
## 1 2011-04-01       0.35
## 2 2012-01-01       0.45
## 3 2012-03-01       0.46
## 4 2012-08-01       0.45
## 5 2012-10-01       0.47
## 6 2012-11-01       0.45

3.3.2 Food expenditures

Removed irrelevant columns

## # A tibble: 6 x 3
##    Year `Food at home` `Food away from home`
##   <dbl>          <dbl>                 <dbl>
## 1  1997        353664.               265069.
## 2  1998        364084.               278004.
## 3  1999        386318.               292034.
## 4  2000        404815.               313060.
## 5  2001        423349.               326063.
## 6  2002        428442.               333839.

3.3.3 Working hour

Changed the date time format into desired date type

Removed irrelevant Columns

3.3.4 CPI

After removing the irrelevant columns, we converted the multiple monthly data of “CPI for all” and “CPI for food” to the new column. Then, we adjusted the format of the time to “Year” and “Month”. Since the two datasets now have a consistent scale and format, we merge the two data frames together by time.

  • Combine CPI and working hours
## # A tibble: 6 x 6
##    Year Month food_cpi date       all_cpi week_working_hour
##   <dbl> <chr>    <dbl> <date>       <dbl>             <dbl>
## 1  2012 Jan        0.6 2012-01-01     0.4              34.5
## 2  2012 Feb        0   2012-02-01     0.4              34.5
## 3  2012 Mar        0.1 2012-03-01     0.8              34.4
## 4  2012 Apr        0.2 2012-04-01     0.3              34.5
## 5  2012 May        0.1 2012-05-01    -0.1              34.4
## 6  2012 Jun        0.1 2012-06-01    -0.1              34.4