Data Analysis of NYC Airbnb
Project Report
By: Fullwin Liang
1. Can we predict the rental prices grouped by room type for an Airbnb?
This project, which analyzes public data from Airbnb, is aimed for both consumers and
professionals alike. We address several questions such as, “What type of rooms do people rent out the
most?”, “Where are the most expensive or least expensive rooms located?”, “What are the average prices
of rooms in each borough of NYC?” In addition to those answers, we provide a comparison of the prices
between different rooms in different boroughs. We aim to answer these questions using Exploratory Data
Analysis, or EDA, and we construct a predictive model which utilizes machine learning tools to improve
model accuracy. The purpose of this model is to predict the rental prices for an Airbnb by room type.
2. Data Description:
The Airbnb data set under scrutiny was obtained from Kaggle. The dataset is strictly about
Airbnb’s in NYC. This data set contains 48,895 observations along with 16 variables. The 16 variables
names are User “ID”, “Name”, “Host ID”,” Host Name”,” Neighborhood Group” (i.e. boroughs),
coordinate location (“latitude and longitude”), “Room Type” (i.e. private home, apartment, shared
rooms), “Price”, “Host Listing Counts”, “Yearly Availability of the Room”, “Minimum Nights” in which
one could stay, “Number of Reviews”, along with “Reviews per Month.”
After obtaining our dataset, we ran into several issues that could potentially impact our analysis.
Among the several issues were input errors, null values, and extraneous data. To combat these issues,we
had to clean the data. This involved identifying and distinguishing the types of errors, i.e. input errors and
null values and determining whether there was any salvageable data. In the case of price, we found that
there were input errors, such as price being equal to zero, so in those cases we had to delete the data since
it could not meaningfully contribute to our analysis.
A portion of the data was also irrelevant, so we were forced to eliminate these variables. This
irrelevance in data most likely only existed to ease previous internal analysis of the data or to help catalog
user information. Among the eliminated factors were the User ID, and Host Name.Since this data had no
meaningful insight, we decided to reduce our data set so that we wouldn’t have to consider factors that did
not contribute to our analysis. According to our data, we have some missing data on the variable,
“Reviews_per_month” which accounts to about 20.56% missing data from the total observations. To
manipulate and clean the dataset, I assumed that there were null values because there have been 0 reviews
where the data is missing for “reviews_per_month”. Therefore, I set the null values to 0 for that variable
which fixes the problem of having null values in the dataset. In addition to this, I also got rid of the
observations where the price of the Airbnb is equal to 0. This must’ve been some error in the dataset
because I would assume there is no free Airbnb’s in NYC.
.
3. Exploratory Data Analysis (Discrete Variables)
Figure 1.1
Based on Figure 1.1, we can get a rough estimation of the location of Airbnb’s. Each borough is
differentiated by a unique color. We can tell that Staten Island, which is colored in purple, has a greater
distance between each Airbnb. This could potentially be due to Staten Island having the most landmass
and lowest population density, therefore having the less occupied Airbnb’s.
Figure 1.2
Figure 1.2 presents bar charts which plots the frequency of Airbnb’s that are located in Manhattan,
Brooklyn, Queens, the Bronx, and Staten Island. The counts for these are 21661,20104, 5666, 1091, and
373, respectively.
Figure 1.3
Figure 1.3 shows the count of rooms located in Manhattan, Brooklyn, Queens, the Bronx, and Staten
Island and colors them by room type, which includes “entire homes/apartments”, “private room”, and
“shared room.” Thus based on the charts, we can conclude that Manhattan tends to be the borough with
the most rented Airbnbs and the most popular room type tends to be “entire homes/apartments”.
4. Continued EDA on Continuous Variables
Now that we have answered some pertinent questions regarding our discrete variables, such as the
quantity of room types in different boroughs, we shift our focus towards our continuous variable, price.
the amount of Airbnb’s in different boroughs and a general idea of the distribution of prices. We now
want to ask, which borough has the most expensive or least expensive Airbnb rooms. We can answer this
by creating a boxplot to compare each borough and the price ranges.
Figure 1.4
Based on Figure 1.3 the plot indicates that Manhattan has the highest median, maximum, minimum, and
interquartile range for price out of all boroughs. Therefore, we can say that generally, Airbnb’s located in
Manhattan tend to cost more. The borough with the second highest median and interquartile range is
Brooklyn. While the others vary as to which have the highest median and interquartile range.
Figure 1.5
Figure 1.6
Based on Figure 1.5, this visualization shows us a skewed distribution since a majority of the prices lie
on the left(i.e. positively skewed). Therefore, we can use a log transformation on price to obtain some
more meaningful information. Notice the change in distribution in Figure 1.6. This indicates that the
distribution of our data is Log-Normal. Note that we scaled by rather than . Below weog xl
10
n(x)l
calculate some summary statistics from the data.
Price of Airbnb
Min. : 10.0
1st Qu.: 69.0
Median : 106.0
Mean : 152.8
3rd Qu.: 175.0
Max. : 10000.0
So from this we can see that on average, an Airbnb costs $152.80 regardless of borough or room type.
Figure 1.7
In Figure 1.7, we have our data after a log transformation. We again modeled price and tried to get an idea
of the distribution of rental prices by borough rather than as a whole. Similary to Figure 1.6, the price of
Airbnb’s in each borough appears to follow a log-normal distribution. As one might intuitively expect,
we found that Manhattan has the highest average price for an Airbnb at about $196.88. Not far behind is
Brooklyn at an average of $124.44 for an Airbnb. Next comes Staten Island with a mean of $114.81 and
then comes Queens with a mean of $99.52. Lastly, we have Bronx with the mean price of $87.58.
4. Predictive Model
The first step in the development of our model was to identify a variable of interest. Since we
wanted our analysis to be of interest to both consumers and producers alike, we decided to build a model
that would work to predict rental prices per room type given other independent variables. This would be
accomplished by supervised learning techniques.
Once we decided that price would be our response variable, we had to build a model. The first
step was determining which predictors to use in our model. To find the best model we had to choose a
subset selection algorithm. We used both Greedy forward selection and backward elimination. Here we
utilized R and had it select the model with the lowest AIC, via these subset selection algorithms.
Afterwards, we wanted to consider any possible interaction effects between our predictors. We
considered that our predictors longitude and latitude could be subject to the interaction effect. Doing so,
we attempted to predict price using room type, neighbourhood group, latitude and longitude(where we
considered an interaction effect between the two), number of reviews, yearly availability,reviews per
month, calculated host listings count, minimum night. However, it appeared that our data was skewed
from our earlier distribution, so a log transformation on price again seemed appropriate, so we settled on
the following regression equation:
rice room type B neighbourhood group B latitude longitude number of reviewsp = B B
0
+
1
+
2
+
3
*
B
4
+ B
5
+
availability365 B reviews_per_month calculated host listings count minimum_nightsB
6
+
7
+ B
8
+ B
9
Where, our B values are the coefficients listed in this table. To determine the most accurate model we
used several values such as, the Akaike Information Criterion, the Mean Squared Error, and ,R
2
ultimately, we decided to include an interaction effect.
Figure 1.8
Figure 1.10
After constructing our model, we finally began to test it. When we trained our model, we used the hold
out method, where we used 80% of the data for training and then tested using about 20% of the data.
While accurate to an extent, this model does not manage to accurately predict shared room.
Using this model we had ~ 48% , AIC ~ 19000, and MSE = .107.R
2
5. Conclusion
To conclude the report we start by restating preliminary questions and then lay out our findings.
The questions were 1. “What type of rooms do people rent out the most?”, 2. “Where are the most
expensive or least expensive rooms located?”, and 3.“What are the average prices of rooms in each
borough of NYC?” To which we answer 1. Based on the frequency of the rooms rented, we believe that
the most popular room type is an entire home/apartment,with a frequency of 25409, typically found in
Manhattan. 2. Based on the distribution of our price, again referring to Figure 1.4, we found that the most
expensive rooms, with a price of$196.88, were located in Manhattan and the least expensive rooms, with
a price of $87.58, were located in the Bronx. 3. According to the distribution of price, refer to Figure 1.7 ,
we can hypothesize that the average prices of a room in Manhattan, Brooklyn, Queens, the Bronx, and
Staten Island are around $196.88, $124.44, $99.52, $87.58, $114.81, respectively.Concerning our
predictive model for price, it was based on nine predictors which were room type, neighbourhood group,
latitude and longitude(where we considered an interaction effect between the two), number of reviews,
yearly availability,reviews per month, calculated host listings count, minimum nights and had the
following statistics, ~ 48% , AIC ~ 19000, and MSE = .107.R
2
Reference
The direction of our EDA and model was inspired by the following notebooks:
https://www.kaggle.com/josipdomazet/mining-nyc-airbnb-data-using-r/report#data-visual
isation
https://www.kaggle.com/nishok03/price-prediction-with-xgb-gbr-data-exploration/notebo
ok
https://www.kaggle.com/kmenjo/airbnb-simple-analysis