# import libraries
import numpy as np
import pandas as pd
# show all columns (instead of cascading columns in the middle)
"display.max_columns", None)
pd.set_option(# don"t show numbers in scientific notation
"display.float_format", "{:.2f}".format)
pd.set_option(
# read the Python Pickle and PyArrow Parquet files
= pd.read_pickle("data/airbnb_amsterdam_listings.pkl")
df_list = pd.read_parquet("data/airbnb_amsterdam_calendar.parquet", engine="pyarrow") df_cal
Introduction
Data overview
Setting up the programming environment
Listings data
df_list.head()
id | host_acceptance_rate | host_is_superhost | host_listings_count | host_total_listings_count | neighbourhood_cleansed | latitude | longitude | room_type | accommodates | bedrooms | beds | amenities | price | minimum_nights | maximum_nights | has_availability | availability_30 | availability_60 | availability_90 | availability_365 | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | review_scores_rating | instant_bookable | reviews_per_month | price_in_euros | price_per_person | minimum_price | discount_per_5_days_booked | discount_per_10_days_booked | discount_per_30_and_more_days_booked | service_cost | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 23726706 | 0.95 | f | 1 | 1 | IJburg - Zeeburgereiland | 52.35 | 4.98 | Private room | 2 | 1.00 | 1.00 | 6 | $88.00 | 2 | 14 | t | 0 | 6 | 20 | 66 | 78 | 11 | 3 | 4.99 | t | 1.53 | None | $44 | $176 | 5% | 11% | 16% | $4.99 |
1 | 35815036 | 1.00 | t | 1 | 1 | Noord-Oost | 52.42 | 4.96 | Entire home/apt | 2 | NaN | 1.00 | 5 | $105.00 | 3 | 100 | t | 4 | 6 | 12 | 243 | 95 | 36 | 6 | 4.96 | f | 2.65 | None | $52.5 | $315 | 5% | 12% | 16% | $4.99 |
2 | 31553121 | 1.00 | f | 1 | 1 | Noord-West | 52.43 | 4.92 | Entire home/apt | 4 | 1.00 | 3.00 | 3 | $152.00 | 2 | 60 | t | 0 | 3 | 3 | 3 | 82 | 26 | 1 | 4.74 | f | 2.02 | None | $38 | $304 | 7% | 11% | 22% | $4.99 |
3 | 34745823 | 0.94 | f | 3 | 3 | Gaasperdam - Driemond | 52.30 | 5.01 | Entire home/apt | 2 | 1.00 | 2.00 | 8 | $87.00 | 2 | 1125 | t | 5 | 20 | 26 | 290 | 39 | 4 | 0 | 4.87 | f | 1.08 | None | $43.5 | $174 | 6% | 10% | 15% | $4.99 |
4 | 44586947 | 0.88 | t | 0 | 0 | Gaasperdam - Driemond | 52.31 | 5.03 | Private room | 4 | 2.00 | 3.00 | 4 | $160.00 | 2 | 31 | t | 9 | 32 | 62 | 152 | 15 | 12 | 3 | 5.00 | f | 0.68 | None | $40 | $320 | 9% | 22% | 20% | $4.99 |
This dataset needs extensive cleaning which will be performed in the next step.
Data dictionary for the listings data
Here is the definition of the listings data columns:
id
- Airbnb’s unique identifier for the listinghost_acceptance_rate
- rate at which a host accepts booking requesthost_is_superhost
- some Airbnb hosts are considered better than others, which this feature displayshost_listings_count
- the number of listings the host has (per Airbnb calculations)host_total_listings_count
- the number of listings the host has (per Airbnb calculations)neighborhood_cleansed
- all neighborhoods present in Amsterdamlatitude
- geographical coordinate going from North to Southlongitude
- geographical coordinate going from East to Westroom_type
- all room types available in Amsterdam on Airbnbaccommodates
- maximum number of people that can fit in the Airbnb listingbedrooms
- number of bedroomsbeds
- number of beds (does not reveal if they are two-person or one-person beds)amenities
- number of amenities present in the Airbnb listingprice
- price per day in US dollarsminimum_nights
- minimum number of nights you need to book for this listingmaximum_nights
- maximum number of nights you are allowed to book this listinghas_availability
- displays if this listing can be bookedavailability_30
- number of available spots in the next 30 daysnumber_of_reviews_l30d
- number of reviews created in the last 30 daysreview_scores_rating
- average rating given to the Airbnb listinginstant_bookable
- whether the guest can automatically book the listing without the host requiring to accept their booking request. An indicator of a commercial listing.price_per_person
- price per person per night in US dollarsminimum_price
- price in US dollars times minimum nightsdiscount_per_…
- displays the discount when 5, 10, or 30+ days are bookedservice_cost
- total service cost paid at booking
Calendar data
df_cal.head()
listing_id | date | available | price_in_dollar | minimum_nights | maximum_nights | |
---|---|---|---|---|---|---|
0 | 23726706 | 2022-06-05 | False | 90.00 | 2 | 1125 |
1 | 23726706 | 2022-06-06 | False | 90.00 | 2 | 1125 |
2 | 23726706 | 2022-06-07 | False | 90.00 | 2 | 1125 |
3 | 23726706 | 2022-06-08 | False | 90.00 | 2 | 1125 |
4 | 23726706 | 2022-06-09 | False | 85.00 | 2 | 1125 |
Data dictionary for the calendar data
Here is the definition of the calendar data columns:
listing_id
- Airbnb’s unique identifier for the listingdate
- the date in the listing’s calendaravailable
- whether the date is available for a bookingprice_in_dollar
- the price listed for the dayminimum_nights
- minimum nights for a booking made on this daymaximum_nights
- maximum nights for a booking made on this day
Cleaning the Listings DataFrame
To get an overview of the Listings DataFrame, called df_list
, we’ll need some details about the columns in the DataFrame, like the column names, their data types, and the number of non-null values.
df_list.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6165 entries, 0 to 6172
Data columns (total 34 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 6165 non-null int64
1 host_acceptance_rate 5365 non-null float64
2 host_is_superhost 6165 non-null object
3 host_listings_count 6165 non-null int64
4 host_total_listings_count 6165 non-null int64
5 neighbourhood_cleansed 6165 non-null object
6 latitude 6165 non-null float64
7 longitude 6165 non-null float64
8 room_type 6165 non-null object
9 accommodates 6165 non-null int64
10 bedrooms 5859 non-null float64
11 beds 6082 non-null float64
12 amenities 6165 non-null int64
13 price 6165 non-null object
14 minimum_nights 6165 non-null int64
15 maximum_nights 6165 non-null int64
16 has_availability 6165 non-null object
17 availability_30 6165 non-null int64
18 availability_60 6165 non-null int64
19 availability_90 6165 non-null int64
20 availability_365 6165 non-null int64
21 number_of_reviews 6165 non-null int64
22 number_of_reviews_ltm 6165 non-null int64
23 number_of_reviews_l30d 6165 non-null int64
24 review_scores_rating 5581 non-null float64
25 instant_bookable 6165 non-null object
26 reviews_per_month 5581 non-null float64
27 price_in_euros 0 non-null object
28 price_per_person 6165 non-null object
29 minimum_price 6165 non-null object
30 discount_per_5_days_booked 6165 non-null object
31 discount_per_10_days_booked 6165 non-null object
32 discount_per_30_and_more_days_booked 6165 non-null object
33 service_cost 6165 non-null object
dtypes: float64(7), int64(14), object(13)
memory usage: 1.6+ MB
= pd.read_pickle("data/airbnb_amsterdam_listings.pkl") df_list
STEP 1 - Perform this four-step process to change each of the three discount_per_...
columns into their proper format:
- Remove non-numeric characters, like the percent symbol, so you can perform mathematical calculations on the column
- Change the column into a float data type in order to convert the data into a ratio
- Multiply the whole column by 0.01 so you end up with a probability ratio instead of a percentage
- Overwrite the old discount_per_… column with this new column
'discount_per_5_days_booked':'discount_per_30_and_more_days_booked'] = (
df_list.loc[:,'discount_per_5_days_booked':'discount_per_30_and_more_days_booked']
df_list.loc[:,'%', '', regex=True)
.replace('float')
.astype(0.01)
.multiply(
)
'discount_per_5_days_booked':'discount_per_30_and_more_days_booked'].head() df_list.loc[:,
discount_per_5_days_booked | discount_per_10_days_booked | discount_per_30_and_more_days_booked | |
---|---|---|---|
0 | 0.05 | 0.11 | 0.16 |
1 | 0.05 | 0.12 | 0.16 |
2 | 0.07 | 0.11 | 0.22 |
3 | 0.06 | 0.10 | 0.15 |
4 | 0.09 | 0.22 | 0.20 |
STEP 2 - The columns host_is_superhost
, instant_bookable
, and has_availability
are all boolean columns in the sense that their data represents true and false values, but currently are recognized as objects.
"host_is_superhost", "instant_bookable", "has_availability"]].head() df_list[[
host_is_superhost | instant_bookable | has_availability | |
---|---|---|---|
0 | f | t | t |
1 | t | f | t |
2 | f | f | t |
3 | f | f | t |
4 | t | f | t |
The letters in these columns (t and f) are written as strings and not as boolean data types. This means we need to replace our string values with the boolean equivalent dtype.
Change the columns host_is_superhost, instant_bookable, and has_availability into a boolean data type for better data processing:
- Replace f and t with False and True
- Set the column as type bool
- Overwrite the old columns with the new values
"host_is_superhost", "instant_bookable", "has_availability"]] = (
df_list[["host_is_superhost", "instant_bookable", "has_availability"]]
df_list[["f": False, "t": True})
.replace({'bool')
.astype(
)
"host_is_superhost", "instant_bookable", "has_availability"]].head() df_list[[
host_is_superhost | instant_bookable | has_availability | |
---|---|---|---|
0 | False | True | True |
1 | True | False | True |
2 | False | False | True |
3 | False | False | True |
4 | True | False | True |
STEP 3 - A closer look at the prices in the four columns price
, price_per_person
, minimum_price
, and service_cost
reveals that they all follow the same pattern:
"price", "price_per_person", "minimum_price", 'service_cost']].head() df_list[[
price | price_per_person | minimum_price | service_cost | |
---|---|---|---|---|
0 | $88.00 | $44 | $176 | $4.99 |
1 | $105.00 | $52.5 | $315 | $4.99 |
2 | $152.00 | $38 | $304 | $4.99 |
3 | $87.00 | $43.5 | $174 | $4.99 |
4 | $160.00 | $40 | $320 | $4.99 |
'price'] = (
df_list['price']
df_list[str.replace('$', '', regex=True)
.str.replace(',', '', regex=True)
.'float')
.astype(
)
'price_per_person'] = (
df_list['price_per_person']
df_list[str.replace('$', '', regex=True)
.str.replace(',', '', regex=True)
.'float')
.astype(
)
'minimum_price'] = (
df_list['minimum_price']
df_list[str.replace('$', '', regex=True)
.str.replace(',', '', regex=True)
.'float')
.astype(
)
'service_cost'] = (
df_list['service_cost']
df_list[str.replace('$', '', regex=True)
.str.replace(',', '', regex=True)
.'float')
.astype(
)
"price", "price_per_person", "minimum_price", 'service_cost']].head() df_list[[
price | price_per_person | minimum_price | service_cost | |
---|---|---|---|---|
0 | 88.00 | 44.00 | 176.00 | 4.99 |
1 | 105.00 | 52.50 | 315.00 | 4.99 |
2 | 152.00 | 38.00 | 304.00 | 4.99 |
3 | 87.00 | 43.50 | 174.00 | 4.99 |
4 | 160.00 | 40.00 | 320.00 | 4.99 |
STEP 4 - Renaming columns
The following column names need to be changed:
price
intoprice_in_dollar
neighbourhood_cleansed
intoneighbourhood
= df_list.rename(
df_list ={
columns"price": "price_in_dollar",
"neighbourhood_cleansed": "neighbourhood"
} )
STEP 5 Taking a closer look at the neighbourhood
and room_type
columns reveals that these columns are assigned an object dtype. We want them to be a category dtype.
'neighbourhood', 'room_type']] = (
df_list[['neighbourhood', 'room_type']]
df_list[['category')
.astype( )
STEP 6 - Deleting irrelevant columns
We need to delete some columns that are irrelevant to this current use case.
Let’s inspect the price_in_euros
column first, because this column seems to contain only null values, which inherently do not add any meaning to the dataset.
'price_in_euros'].unique() df_list[
array([None], dtype=object)
The approach reveals that this column contains no unique values and is thus empty. We’ll trop it with other irrelevant columns.
= df_list.drop(
df_list =[
columns'price_in_euros',
'host_listings_count',
'host_total_listings_count',
'availability_60',
'availability_90',
'availability_365',
'number_of_reviews',
'number_of_reviews_ltm',
'reviews_per_month'
] )
Let’s now have a look at which data types we still need to change and which columns have some null values.
=True, show_counts=True) df_list.info(verbose
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6165 entries, 0 to 6172
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 6165 non-null int64
1 host_acceptance_rate 5365 non-null float64
2 host_is_superhost 6165 non-null bool
3 neighbourhood 6165 non-null category
4 latitude 6165 non-null float64
5 longitude 6165 non-null float64
6 room_type 6165 non-null category
7 accommodates 6165 non-null int64
8 bedrooms 5859 non-null float64
9 beds 6082 non-null float64
10 amenities 6165 non-null int64
11 price_in_dollar 6165 non-null float64
12 minimum_nights 6165 non-null int64
13 maximum_nights 6165 non-null int64
14 has_availability 6165 non-null bool
15 availability_30 6165 non-null int64
16 number_of_reviews_l30d 6165 non-null int64
17 review_scores_rating 5581 non-null float64
18 instant_bookable 6165 non-null bool
19 price_per_person 6165 non-null float64
20 minimum_price 6165 non-null float64
21 discount_per_5_days_booked 6165 non-null float64
22 discount_per_10_days_booked 6165 non-null float64
23 discount_per_30_and_more_days_booked 6165 non-null float64
24 service_cost 6165 non-null float64
dtypes: bool(3), category(2), float64(13), int64(7)
memory usage: 1.0 MB
STEP 7 - Some listings have no reviews and an unknown host acceptance rate. Most Airbnb users exclude such listings from their search results. To mimic this filtering approach, we’ll filter out any rows that do not have a review_scores_rating
and without a `host_acceptance_rate
.
= df_list.dropna(subset=['review_scores_rating', 'host_acceptance_rate']) df_list
STEP 8 - Which columns still have missing values and how much?
sum().loc[lambda x : x != 0] df_list.isnull().
bedrooms 264
beds 69
dtype: int64
Let’s make some simple assumptions based on the room_type
assigned to the listing. Which room types are found in the dataset.
"room_type"].value_counts() df_list[
Entire home/apt 3158
Private room 1633
Hotel room 67
Shared room 28
Name: room_type, dtype: int64
There are four room types. Let’s make the assumption that the columns bedrooms
and beds
are potentially influenced by room_type
.
Therefore, we can make the following rules:
- If you have a “Private room” or “Shared room” as
room_type
, then we believe the listing only has one bedroom. - If the listing has “Hotel room” or “Entire home/apt” as
room_type
, then we can divide the number of guests the listing accomodates by 2 and round up. - If any of these numbers are missing, then we can leave it empty.
Translate these requirements into a Python function, and you get:
def fill_empty_bedrooms(accommodates: int, bedrooms: int, room_type: str) -> int:
if (room_type == "Private room") or (room_type == "Shared room"):
return 1
elif (room_type == "Hotel room") or (room_type == "Entire home/apt"):
return np.ceil(accommodates / 2)
else:
return bedrooms
In the case of Pandas, we are using apply()
to semi-vectorize our function, but secretly this function just implements something that mimics a for loop. Using a lambda
together with apply()
allows us to access multiple columns to generate an outcome.
This approach is often good enough, but not always, especially if you are dealing with large datasets. Below we will run the apply()
function for output.
"bedrooms"] = df_list[["accommodates", "bedrooms", "room_type"]].apply(
df_list[lambda x: fill_empty_bedrooms(x["accommodates"], x["bedrooms"], x["room_type"]),
=1,
axis )
Thanks to our logic and assumptions, most listings now have a proper amount of defined rooms. However, there are still a few listings without any number of rooms defined. Remove all rows/entries that have an empty bedrooms, beds.
"bedrooms", "beds"]].isnull().sum() df_list[[
bedrooms 0
beds 69
dtype: int64
= df_list.dropna(subset=["beds"]) df_list
"bedrooms", "beds"]].isnull().sum() df_list[[
bedrooms 0
beds 0
dtype: int64
Memory usage
=True).sum() df_list.memory_usage(deep
835941
Calendar DataFrame
3) df_cal.head(
listing_id | date | available | price_in_dollar | minimum_nights | maximum_nights | |
---|---|---|---|---|---|---|
0 | 23726706 | 2022-06-05 | False | 90.00 | 2 | 1125 |
1 | 23726706 | 2022-06-06 | False | 90.00 | 2 | 1125 |
2 | 23726706 | 2022-06-07 | False | 90.00 | 2 | 1125 |
Minimum stay
You are looking to stay for a minimum of 3 days. Since you are unsure when it will be booked, you’d like to exclude all listing_ids
that go below that threshold of 3 days no matter what time of year. This is because you think that bookings with a minimum stay of 3 days are more likely to have discount prices.
With these excluded, you would like to see the expected booking price for 5 days:
- Using the helper code, create a conditional index for entries
listing_id
that are above the 3 day threshold - Remove them using
.loc[conditional_index]
over the conditional index. - Calculate the price of booking a listing for 5 days by multiplying the current day multiplied by 5, and assign this to a column called
five_day_dollar_price
# First start by making a copy, for debugging purposes
= df_cal.copy()
calendar_newdf
= calendar_newdf["minimum_nights"] >= 3
include_list
# Get all the listings with a minimum nights of 3+
# Use the include_list
= calendar_newdf.loc[include_list]
calendar_newdf
"five_day_dollar_price"] = calendar_newdf["price_in_dollar"] * 5 calendar_newdf[
3) calendar_newdf.head(
listing_id | date | available | price_in_dollar | minimum_nights | maximum_nights | five_day_dollar_price | |
---|---|---|---|---|---|---|---|
365 | 35815036 | 2022-06-05 | False | 105.00 | 3 | 1125 | 525.00 |
367 | 35815036 | 2022-06-07 | True | 105.00 | 3 | 1125 | 525.00 |
368 | 35815036 | 2022-06-08 | True | 105.00 | 3 | 1125 | 525.00 |
Now let’s transform our newly created DataFrame into a pivot table, where we aggregate our rows using the listing_id as the index, and the columns available
and five_day_dollar_price
as values.
= pd.pivot_table(
calendar_summarizeddf =calendar_newdf,
data="listing_id",
index=["available", "five_day_dollar_price"],
values=np.mean
aggfunc# .add_suffix('_avg')
)
calendar_summarizeddf.head()
available | five_day_dollar_price | |
---|---|---|
listing_id | ||
2818 | 0.21 | 346.90 |
44391 | 0.00 | 1200.00 |
49552 | 0.46 | 1162.50 |
55709 | 0.00 | 818.68 |
55868 | 0.00 | 749.22 |
Maximum price and date
Can you make a pivot table that states the maximum price_in_dollar
for every Airbnb listing?
= pd.pivot_table(
temp_sum_df =calendar_newdf,
data=["listing_id"],
index=["price_in_dollar"],
values=np.max
aggfunc'_max')
).add_suffix(
temp_sum_df.head()
price_in_dollar_max | |
---|---|
listing_id | |
2818 | 80.00 |
44391 | 240.00 |
49552 | 300.00 |
55709 | 250.00 |
55868 | 200.00 |
Merging
We are going to merge the pivot table that includes the five_day_dollar_price
for each listing with listings_df.
We have to keep in mind that we want to keep only those rows of Airbnb listing IDs that are present in both datasets.
= pd.merge(
final_df
df_list,
calendar_summarizeddf,=["id"],
left_on=["listing_id"],
right_on="inner"
how
)
3) final_df.head(
id | host_acceptance_rate | host_is_superhost | neighbourhood | latitude | longitude | room_type | accommodates | bedrooms | beds | amenities | price_in_dollar | minimum_nights | maximum_nights | has_availability | availability_30 | number_of_reviews_l30d | review_scores_rating | instant_bookable | price_per_person | minimum_price | discount_per_5_days_booked | discount_per_10_days_booked | discount_per_30_and_more_days_booked | service_cost | available | five_day_dollar_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 35815036 | 1.00 | True | Noord-Oost | 52.42 | 4.96 | Entire home/apt | 2 | 1.00 | 1.00 | 5 | 105.00 | 3 | 100 | True | 4 | 6 | 4.96 | False | 52.50 | 315.00 | 0.05 | 0.12 | 0.16 | 4.99 | 0.66 | 528.89 |
1 | 19572024 | 1.00 | False | Watergraafsmeer | 52.31 | 4.91 | Entire home/apt | 6 | 3.00 | 6.00 | 14 | 279.00 | 3 | 300 | True | 6 | 3 | 4.69 | False | 46.50 | 837.00 | 0.09 | 0.16 | 0.14 | 4.99 | 0.82 | 1496.55 |
2 | 2973384 | 0.38 | False | Watergraafsmeer | 52.31 | 4.91 | Entire home/apt | 5 | 3.00 | 3.00 | 7 | 185.00 | 6 | 21 | True | 0 | 0 | 4.83 | False | 37.00 | 1110.00 | 0.06 | 0.12 | 0.18 | 4.99 | 0.04 | 941.30 |
Grouping
Now, let’s perform a groupby where we look at the median values of five_day_dollar_price
and review_scores_rating
with respect to the room_type.
Do these results match your intuition?
='room_type')[['review_scores_rating', 'five_day_dollar_price']].median() final_df.groupby(by
(final_df='room_type')
.groupby(by=('review_scores_rating', 'median'),
.agg(review_scores_rating_median=('five_day_dollar_price', 'median')
five_day_dollar_price_median
) )
review_scores_rating_median | five_day_dollar_price_median | |
---|---|---|
room_type | ||
Entire home/apt | 4.88 | 975.00 |
Hotel room | 4.56 | 1110.16 |
Private room | 4.79 | 710.91 |
Shared room | 4.60 | 724.11 |
You might have expected that shared rooms are the cheapest and thus have the lowest rating with respect to median scores. The same can’t be said for the most expensive option — a hotel room. Will this influence your future considerations when booking 🤔?
(But before you let this influence your decisions too much, it might be better to assume that this data might be biased in favor of Airbnb and not hotels in general. 🤷)
Memory optimization
final_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2883 entries, 0 to 2882
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 2883 non-null int64
1 host_acceptance_rate 2883 non-null float64
2 host_is_superhost 2883 non-null bool
3 neighbourhood 2883 non-null category
4 latitude 2883 non-null float64
5 longitude 2883 non-null float64
6 room_type 2883 non-null category
7 accommodates 2883 non-null int64
8 bedrooms 2883 non-null float64
9 beds 2883 non-null float64
10 amenities 2883 non-null int64
11 price_in_dollar 2883 non-null float64
12 minimum_nights 2883 non-null int64
13 maximum_nights 2883 non-null int64
14 has_availability 2883 non-null bool
15 availability_30 2883 non-null int64
16 number_of_reviews_l30d 2883 non-null int64
17 review_scores_rating 2883 non-null float64
18 instant_bookable 2883 non-null bool
19 price_per_person 2883 non-null float64
20 minimum_price 2883 non-null float64
21 discount_per_5_days_booked 2883 non-null float64
22 discount_per_10_days_booked 2883 non-null float64
23 discount_per_30_and_more_days_booked 2883 non-null float64
24 service_cost 2883 non-null float64
25 available 2883 non-null float64
26 five_day_dollar_price 2883 non-null float64
dtypes: bool(3), category(2), float64(15), int64(7)
memory usage: 533.0 KB