Final Group Project: AirBnB analytics

Due by 11:59 PM on Tuesday, September 21, 2021

Exploratory Data Analysis (EDA)
- Data wrangling
- Handling missing values (NAs)
Mapping
Regression Analysis
- Further variables/questions to explore on our own
Diagnostics, collinearity, summary tables
Deliverables
Performing peer review
Acknowledgements

In your final group assignment you have to analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in a city. You can download AirBnB data from insideairbnb.com; it was originally scraped from airbnb.com.

Each study group should enter three choices in the AirBnB Analytics Project googlesheet as 1 (first choice), 2 (second choice), and 3 (third choice). I will announce which groups are assigned to which city at the beginning of session 9.

All of the listings are a GZ file, namely they are archive files compressed by the standard GNU zip (gzip) compression algorithm. You can download, save and extract the file if you wanted, but vroom::vroom() or readr::read_csv() can immediately read and extract this kind of a file. You should prefer vroom() as it is faster, but if vroom is limited by a firewall, please use read_csv() instead.

As an example, if you wanted to get the listings for Munich, you just type

listings <- vroom("http://data.insideairbnb.com/germany/bv/munich/2020-06-20/data/listings.csv.gz") %>% 
    clean_names()

Even though there are many variables in the dataframe, here is a quick description of some of the variables collected, with cost data typically expressed in US$

price = cost per night
cleaning_fee: cleaning fee
extra_people: charge for having more than 1 person
property_type: type of accommodation (House, Apartment, etc.)
room_type:
- Entire home/apt (guests have entire place to themselves)
- Private room (Guests have private room to sleep, all other rooms shared)
- Shared room (Guests sleep in room shared with others)
number_of_reviews: Total number of reviews for the listing
review_scores_rating: Average review score (0 - 100)
longitude , latitude: geographical coordinates to help us locate the listing
neighbourhood*: three variables on a few major neighbourhoods in each city

Exploratory Data Analysis (EDA)

In the R4DS Exploratory Data Analysis chapter, the authors state:

“Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation…EDA is fundamentally a creative process. And like most creative processes, the key to asking quality questions is to generate a large quantity of questions.”

Conduct a thorough EDA. Recall that an EDA involves three things:

Looking at the raw values.
- dplyr::glimpse()
Computing summary statistics of the variables of interest, or finding NAs
- mosaic::favstats()
- skimr::skim()
Creating informative visualizations.
- ggplot2::ggplot()
  - geom_histogram() or geom_density() for numeric continuous variables
  - geom_bar() or geom_col() for categorical variables
- GGally::ggpairs() for scaterrlot/correlation matrix
  - Note that you can add transparency to points/density plots in the aes call, for example: aes(colour = gender, alpha = 0.4)

You may wish to have a level 1 header (#) for your EDA, then use level 2 sub-headers (##) to make sure you cover all three EDA bases. At a minimum you should address these questions:

How many variables/columns? How many rows/observations?
Which variables are numbers?
Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?
What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

At this stage, you may also find you want to use filter, mutate, arrange, select, or count. Let your questions lead you!

In all cases, please think about the message your plot is conveying. Don’t just say “This is my X-axis, this is my Y-axis”, but rather what’s the so what of the plot. Tell some sort of story and speculate about the differences in the patterns in no more than a paragraph.

Data wrangling

Once you load the data, it’s always a good idea to use glimpse to see what kind of variables you have and what data type (chr, num, logical, date, etc) they are.

Notice that some of the price data (price, cleaning_fee, extra_people) is given as a character string, e.g., “$176.00”

Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe. To do so, we will first use readr::parse_number() which drops any non-numeric characters before or after the frst number

listings <- listings %>% 
  mutate(price = parse_number(price))

Use typeof(listing$price) to confirm that price is now stored as a number.

Handling missing values (NAs)

Use skimr::skim() function to view a summary of the cleaning_fee data. This is also stored as a character, so you have to turn it into a number, as discussed earlier.

How many observations have missing values for cleaning_fee?
What do you think is the most likely reason for the missing observations of cleaning_fee? In other words, what does a missing value of cleaning_fee indicate?

cleaning_fee an example of data that is missing not at random, since there is a specific pattern/explanation to the missing data.

Fill in the code below to impute the missing values of cleaning_fee with an appropriate numeric value. Then use skimr::skim() function to confirm that there are no longer any missing values of cleaning_fee.

listings <- listings %>%
  mutate(cleaning_fee = case_when(
    is.na(cleaning_fee) ~ ______, 
    TRUE ~ cleaning_fee
  ))

Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. Fill in the code below to create prop_type_simplified.

listings <- listings %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Apartment","______", "______","______") ~ property_type, 
    TRUE ~ "Other"
  ))

Use the code below to check that prop_type_simplified was correctly made.

listings %>%
  count(property_type, prop_type_simplified) %>%
  arrange(desc(n))

Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:

What are the most common values for the variable minimum_nights?
Is ther any value among the common values that stands out?
What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?

Filter the airbnb data so that it only includes observations with minimum_nights <= 4

Mapping

Visualisations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualisations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.

The following code, having created a dataframe listings with all AirbnB listings in Munich, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet

leaflet(data = filter(listings, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

Regression Analysis

For the target variable $Y$, we will use the cost for two people to stay at an Airbnb location for four (4) nights.

Create a new variable called price_4_nights that uses price, cleaning_fee, guests_included, and extra_people to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable $Y$ we want to explain.

Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should you use for the regression model? Why?

Fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

Interpret the coefficient review_scores_rating in terms of price_4_nights.
Interpret the coefficient of prop_type_simplified in terms of price_4_nights.

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explanantory variables in model1 plus room_type.

Further variables/questions to explore on our own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights?
Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
Most owners advertise the exact location of their listing (is_location_exact == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is a listing’s exact location a significant predictor of price_4_nights?
For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nights
What is the effect of cancellation_policy on price_4_nights, after we control for other variables?

Diagnostics, collinearity, summary tables

As you keep building your models, it makes sense to:

Check the residuals, using autoplot(model_x)
As you start building models with more explanatory variables, make sure you use `car::vif(model_x)`` to calculate the Variance Inflation Factor (VIF) for your predictors and determine whether you have colinear variables. A general guideline is that a VIF larger than 5 or 10 is large, and your model may suffer from collinearity. Remove the variable in question and run your model again without it.
Create a summary table, using huxtable (https://mam2021.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted $R^2$, and the Residual Standard Error.
Finally, you must use the best model you came up with for prediction. Suppose you are planning to visit the city you have been assigned to over reading week, and you want to stay in an Airbnb. Find Airbnb’s that are apartment with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights.

if you used a log(price_4_nights) model, make sure you anti-log to convert the value in $

Deliverables

By midnight on Tuesday 21 Sep 2021, you must upload on Canvas a short presentation (max 4-5 slides) with your findings, as some groups will be asked to present in class. You should present your Exploratory Data Analysis, as well as your best model. In addition, you must upload on Canvas your final report, written using R Markdown to introduce, frame, and describe your story and findings. You should include the following in the memo:
- Executive summary
- Background information and summary of the data
- Explanation, description, and code for each individual plot/table/regression model
- Summary of the process of analysis
- Comparison of various models, using huxtable::huxreg()
- Rationale for the final model; significance and some diagnostics
- Applying your model to predict price_4_nights

Remember to follow R Markdown etiquette rules and style; don’t have the Rmd output extraneous messages or warnings, include summary tables in nice tables (use kableExtra), and remove any placeholder texts from past Rmd templates.

Performing peer review

This group assignment will be graded on a rubric. Part of your grade (5/30) will be the outcome of a peer review as discussed below. As part of this assignment you will be reviewing, commenting on, and marking other students’ assignments. Specifically, you will email the Rmd of your work to the group who worked on the same city, and you will have to assess the Rmd they send you.

Deadlines for peer review

By midnight on Thursday 17 Sep 2020, you must email your Rmd to another group indicated by Kostis. The other group are likely to have worked on the same city.
By midnight on Friday, 18 Sep 2020, you must email Kostis your peer review rubric with your assessment and comments.

The allocation of study groups (SG) for peer review is as follows:

Group X reviewed by	Group Y
1	10
10	14
14	1
2	12
12	13
13	2
3	8
8	11
11	3
4	5
5	15
15	4
6	7
7	9
9	6

This means that Group 1 should send their work to Group 10, group 10 to Group 14, etc.

How to do peer review well

Give thoughtful, constructive and considerate comments.
Be specific and concise.
Use the rubric for ideas about criteria to evaluate and comment on.
Try to learn something new and, if you succeed, point that out.
If you can’t find anything to praise or that you found helpful, then at least offer some suggestions in a kind way.

To ensure reproducibility, you might find it useful to attempt to run your classmates’ Rmd. If you cannot execute them, then the code is not reproducible. Also be aware your classmates will hold you to a similar standard.

How to do peer review badly

Your review is so generic that it’s hard to determine which assignment you’re reviewing.
Your review is mean and nasty.
You can’t find anything to praise/learn and yet you don’t offer any suggestions either.

Performing good peer review is difficult! It’s easy to criticise and tear down others’ work and find flaws. We need to be better at this and not just criticise, but highlight good aspects and suggest how to improve the work.

Here is a rubric you can use, but you must also provide your own constructive comments

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Category = col_character(),
##   `Needs Work` = col_character(),
##   Satisfactory = col_character(),
##   Excellent = col_character()
## )

Category	Needs Work	Satisfactory	Excellent
Code	Code is poorly written and not documented.	Easy to follow (both the code, its documentation, and the output).	Code is well-documented (both self-documented and with additional comments as necessary).
Data Visualization	Limited attempts to visualize the data.	Visualizations are straightforward and provide some insight into the data.	Visualizations are informative, insightful, and visually appealing.
Analysis	Missing either a comparison of different models or an analysis of the performance of the model.	Some comparison of different models and some analysis of the performance of the model.	Solid comparison of several models and thorough analysis of the performance of the final model.
Correctness of Results	Many flaws in the analysis; results are unreliable.	Results are correct with no flaws (or only trivial flaws).	Results are not only correct, but the analyses were also technically challenging.
Submission Organization	Explicit setwd(); Rmd does not knit	Rmd knits; but with warning and messages shown in final report	Used Rstudio project and tha here package to organise work; Rmd knits with no issues and contains no messages or warnings

Remarks:

Elaborate on above, especially for “needs work.”
Some specific praise?
Something I learned?
Specific constructive criticism?
Something I know and that you, my peer, might like to know because it is relevant to something you struggled with.

Acknowledgements

The data from this lab is from insideairbnb.com
The material on peer review is derived in part from UBC STAT 545, licensed under the CC BY-NC 3.0 Creative Commons License.

Last updated on June 25, 2021