Data Analysis of NYC Airbnb

Project Report

By: Fullwin Liang

1. Can we predict the rental prices grouped by room type for an Airbnb?

This project, which analyzes public data from Airbnb, is aimed for both consumers and

professionals alike. We address several questions such as, “What type of rooms do people rent out the

most?”, “Where are the most expensive or least expensive rooms located?”, “What are the average prices

of rooms in each borough of NYC?” In addition to those answers, we provide a comparison of the prices

between different rooms in different boroughs. We aim to answer these questions using Exploratory Data

Analysis, or EDA, and we construct a predictive model which utilizes machine learning tools to improve

model accuracy. The purpose of this model is to predict the rental prices for an Airbnb by room type.

2. Data Description:

The Airbnb data set under scrutiny was obtained from Kaggle. The dataset is strictly about

Airbnb’s in NYC. This data set contains 48,895 observations along with 16 variables. The 16 variables

names are User “ID”, “Name”, “Host ID”,” Host Name”,” Neighborhood Group” (i.e. boroughs),

coordinate location (“latitude and longitude”), “Room Type” (i.e. private home, apartment, shared

rooms), “Price”, “Host Listing Counts”, “Yearly Availability of the Room”, “Minimum Nights” in which

one could stay, “Number of Reviews”, along with “Reviews per Month.”

After obtaining our dataset, we ran into several issues that could potentially impact our analysis.

Among the several issues were input errors, null values, and extraneous data. To combat these issues,we

had to clean the data. This involved identifying and distinguishing the types of errors, i.e. input errors and

null values and determining whether there was any salvageable data. In the case of price, we found that

there were input errors, such as price being equal to zero, so in those cases we had to delete the data since

it could not meaningfully contribute to our analysis.

A portion of the data was also irrelevant, so we were forced to eliminate these variables. This

irrelevance in data most likely only existed to ease previous internal analysis of the data or to help catalog

user information. Among the eliminated factors were the User ID, and Host Name.Since this data had no

meaningful insight, we decided to reduce our data set so that we wouldn’t have to consider factors that did

not contribute to our analysis. According to our data, we have some missing data on the variable,

“Reviews_per_month” which accounts to about 20.56% missing data from the total observations. To

manipulate and clean the dataset, I assumed that there were null values because there have been 0 reviews

where the data is missing for “reviews_per_month”. Therefore, I set the null values to 0 for that variable

which fixes the problem of having null values in the dataset. In addition to this, I also got rid of the

observations where the price of the Airbnb is equal to 0. This must’ve been some error in the dataset

because I would assume there is no free Airbnb’s in NYC.

3. Exploratory Data Analysis (Discrete Variables)

Figure 1.1

Based on Figure 1.1, we can get a rough estimation of the location of Airbnb’s. Each borough is

differentiated by a unique color. We can tell that Staten Island, which is colored in purple, has a greater

distance between each Airbnb. This could potentially be due to Staten Island having the most landmass

and lowest population density, therefore having the less occupied Airbnb’s.

Figure 1.2

Figure 1.2 presents bar charts which plots the frequency of Airbnb’s that are located in Manhattan,

Brooklyn, Queens, the Bronx, and Staten Island. The counts for these are 21661,20104, 5666, 1091, and

373, respectively.

Figure 1.3

Figure 1.3 shows the count of rooms located in Manhattan, Brooklyn, Queens, the Bronx, and Staten

Island and colors them by room type, which includes “entire homes/apartments”, “private room”, and

“shared room.” Thus based on the charts, we can conclude that Manhattan tends to be the borough with

the most rented Airbnbs and the most popular room type tends to be “entire homes/apartments”.

4. Continued EDA on Continuous Variables

Now that we have answered some pertinent questions regarding our discrete variables, such as the

quantity of room types in different boroughs, we shift our focus towards our continuous variable, price.

the amount of Airbnb’s in different boroughs and a general idea of the distribution of prices. We now

want to ask, which borough has the most expensive or least expensive Airbnb rooms. We can answer this

by creating a boxplot to compare each borough and the price ranges.

Figure 1.4

Based on Figure 1.3 the plot indicates that Manhattan has the highest median, maximum, minimum, and

interquartile range for price out of all boroughs. Therefore, we can say that generally, Airbnb’s located in

Manhattan tend to cost more. The borough with the second highest median and interquartile range is

Brooklyn. While the others vary as to which have the highest median and interquartile range.

Figure 1.5

Figure 1.6

Based on Figure 1.5, this visualization shows us a skewed distribution since a majority of the prices lie

on the left(i.e. positively skewed). Therefore, we can use a log transformation on price to obtain some

more meaningful information. Notice the change in distribution in Figure 1.6. This indicates that the

distribution of our data is Log-Normal. Note that we scaled by rather than . Below weog xl

n(x)l

calculate some summary statistics from the data.

Price of Airbnb

Min. : 10.0

1st Qu.: 69.0

Median : 106.0

Mean : 152.8

3rd Qu.: 175.0

Max. : 10000.0

So from this we can see that on average, an Airbnb costs $152.80 regardless of borough or room type.

Figure 1.7

In Figure 1.7, we have our data after a log transformation. We again modeled price and tried to get an idea

of the distribution of rental prices by borough rather than as a whole. Similary to Figure 1.6, the price of

Airbnb’s in each borough appears to follow a log-normal distribution. As one might intuitively expect,

we found that Manhattan has the highest average price for an Airbnb at about $196.88. Not far behind is

Brooklyn at an average of $124.44 for an Airbnb. Next comes Staten Island with a mean of $114.81 and

then comes Queens with a mean of $99.52. Lastly, we have Bronx with the mean price of $87.58.

4. Predictive Model

The first step in the development of our model was to identify a variable of interest. Since we

wanted our analysis to be of interest to both consumers and producers alike, we decided to build a model

that would work to predict rental prices per room type given other independent variables. This would be

accomplished by supervised learning techniques.

Once we decided that price would be our response variable, we had to build a model. The first

step was determining which predictors to use in our model. To find the best model we had to choose a

subset selection algorithm. We used both Greedy forward selection and backward elimination. Here we

utilized R and had it select the model with the lowest AIC, via these subset selection algorithms.

Afterwards, we wanted to consider any possible interaction effects between our predictors. We

considered that our predictors longitude and latitude could be subject to the interaction effect. Doing so,

we attempted to predict price using room type, neighbourhood group, latitude and longitude(where we

considered an interaction effect between the two), number of reviews, yearly availability,reviews per

month, calculated host listings count, minimum night. However, it appeared that our data was skewed

from our earlier distribution, so a log transformation on price again seemed appropriate, so we settled on

the following regression equation:

rice room type B neighbourhood group B latitude longitude number of reviewsp = B B

+ B

availability365 B reviews_per_month calculated host listings count minimum_nightsB

+ B

Where, our B values are the coefficients listed in this table. To determine the most accurate model we

used several values such as, the Akaike Information Criterion, the Mean Squared Error, and ,R

ultimately, we decided to include an interaction effect.

Figure 1.8

Figure 1.10

After constructing our model, we finally began to test it. When we trained our model, we used the hold

out method, where we used 80% of the data for training and then tested using about 20% of the data.

While accurate to an extent, this model does not manage to accurately predict shared room.

Using this model we had ~ 48% , AIC ~ 19000, and MSE = .107.R

5. Conclusion

To conclude the report we start by restating preliminary questions and then lay out our findings.

The questions were 1. “What type of rooms do people rent out the most?”, 2. “Where are the most

expensive or least expensive rooms located?”, and 3.“What are the average prices of rooms in each

borough of NYC?” To which we answer 1. Based on the frequency of the rooms rented, we believe that

the most popular room type is an entire home/apartment,with a frequency of 25409, typically found in

Manhattan. 2. Based on the distribution of our price, again referring to Figure 1.4, we found that the most

expensive rooms, with a price of$196.88, were located in Manhattan and the least expensive rooms, with

a price of $87.58, were located in the Bronx. 3. According to the distribution of price, refer to Figure 1.7 ,

we can hypothesize that the average prices of a room in Manhattan, Brooklyn, Queens, the Bronx, and

Staten Island are around $196.88, $124.44, $99.52, $87.58, $114.81, respectively.Concerning our

predictive model for price, it was based on nine predictors which were room type, neighbourhood group,

latitude and longitude(where we considered an interaction effect between the two), number of reviews,

yearly availability,reviews per month, calculated host listings count, minimum nights and had the

following statistics, ~ 48% , AIC ~ 19000, and MSE = .107.R

Reference

The direction of our EDA and model was inspired by the following notebooks:

https://www.kaggle.com/josipdomazet/mining-nyc-airbnb-data-using-r/report#data-visual

isation

https://www.kaggle.com/nishok03/price-prediction-with-xgb-gbr-data-exploration/notebo

https://www.kaggle.com/kmenjo/airbnb-simple-analysis