1. Can we predict the rental prices grouped by room type for an Airbnb?
This project, which analyzes public data from Airbnb, is aimed for both consumers and
professionals alike. We address several questions such as, “What type of rooms do people rent out the
most?”, “Where are the most expensive or least expensive rooms located?”, “What are the average prices
of rooms in each borough of NYC?” In addition to those answers, we provide a comparison of the prices
between different rooms in different boroughs. We aim to answer these questions using Exploratory Data
Analysis, or EDA, and we construct a predictive model which utilizes machine learning tools to improve
model accuracy. The purpose of this model is to predict the rental prices for an Airbnb by room type.
2. Data Description:
The Airbnb data set under scrutiny was obtained from Kaggle. The dataset is strictly about
Airbnb’s in NYC. This data set contains 48,895 observations along with 16 variables. The 16 variables
names are User “ID”, “Name”, “Host ID”,” Host Name”,” Neighborhood Group” (i.e. boroughs),
coordinate location (“latitude and longitude”), “Room Type” (i.e. private home, apartment, shared
rooms), “Price”, “Host Listing Counts”, “Yearly Availability of the Room”, “Minimum Nights” in which
one could stay, “Number of Reviews”, along with “Reviews per Month.”
After obtaining our dataset, we ran into several issues that could potentially impact our analysis.
Among the several issues were input errors, null values, and extraneous data. To combat these issues,we
had to clean the data. This involved identifying and distinguishing the types of errors, i.e. input errors and
null values and determining whether there was any salvageable data. In the case of price, we found that
there were input errors, such as price being equal to zero, so in those cases we had to delete the data since
it could not meaningfully contribute to our analysis.
A portion of the data was also irrelevant, so we were forced to eliminate these variables. This
irrelevance in data most likely only existed to ease previous internal analysis of the data or to help catalog
user information. Among the eliminated factors were the User ID, and Host Name.Since this data had no
meaningful insight, we decided to reduce our data set so that we wouldn’t have to consider factors that did
not contribute to our analysis. According to our data, we have some missing data on the variable,
“Reviews_per_month” which accounts to about 20.56% missing data from the total observations. To
manipulate and clean the dataset, I assumed that there were null values because there have been 0 reviews
where the data is missing for “reviews_per_month”. Therefore, I set the null values to 0 for that variable
which fixes the problem of having null values in the dataset. In addition to this, I also got rid of the
observations where the price of the Airbnb is equal to 0. This must’ve been some error in the dataset
because I would assume there is no free Airbnb’s in NYC.