Sandra Jurela - Streamlit app - Airbnb Amsterdam Listings

Introduction

Data overview

Setting up the programming environment

# import libraries
import numpy as np
import pandas as pd

# show all columns (instead of cascading columns in the middle)
pd.set_option("display.max_columns", None)
# don"t show numbers in scientific notation
pd.set_option("display.float_format", "{:.2f}".format)

# read the Python Pickle and PyArrow Parquet files
df_list = pd.read_pickle("data/airbnb_amsterdam_listings.pkl")
df_cal = pd.read_parquet("data/airbnb_amsterdam_calendar.parquet", engine="pyarrow")

Listings data

df_list.head()

	id	host_acceptance_rate	host_is_superhost	host_listings_count	host_total_listings_count	neighbourhood_cleansed	latitude	longitude	room_type	accommodates	bedrooms	beds	amenities	price	minimum_nights	maximum_nights	has_availability	availability_30	availability_60	availability_90	availability_365	number_of_reviews	number_of_reviews_ltm	number_of_reviews_l30d	review_scores_rating	instant_bookable	reviews_per_month	price_in_euros	price_per_person	minimum_price	discount_per_5_days_booked	discount_per_10_days_booked	discount_per_30_and_more_days_booked	service_cost
0	23726706	0.95	f	1	1	IJburg - Zeeburgereiland	52.35	4.98	Private room	2	1.00	1.00	6	$88.00	2	14	t	0	6	20	66	78	11	3	4.99	t	1.53	None	$44	$176	5%	11%	16%	$4.99
1	35815036	1.00	t	1	1	Noord-Oost	52.42	4.96	Entire home/apt	2	NaN	1.00	5	$105.00	3	100	t	4	6	12	243	95	36	6	4.96	f	2.65	None	$52.5	$315	5%	12%	16%	$4.99
2	31553121	1.00	f	1	1	Noord-West	52.43	4.92	Entire home/apt	4	1.00	3.00	3	$152.00	2	60	t	0	3	3	3	82	26	1	4.74	f	2.02	None	$38	$304	7%	11%	22%	$4.99
3	34745823	0.94	f	3	3	Gaasperdam - Driemond	52.30	5.01	Entire home/apt	2	1.00	2.00	8	$87.00	2	1125	t	5	20	26	290	39	4	0	4.87	f	1.08	None	$43.5	$174	6%	10%	15%	$4.99
4	44586947	0.88	t	0	0	Gaasperdam - Driemond	52.31	5.03	Private room	4	2.00	3.00	4	$160.00	2	31	t	9	32	62	152	15	12	3	5.00	f	0.68	None	$40	$320	9%	22%	20%	$4.99

This dataset needs extensive cleaning which will be performed in the next step.

Data dictionary for the listings data

Here is the definition of the listings data columns:

id - Airbnb’s unique identifier for the listing
host_acceptance_rate - rate at which a host accepts booking request
host_is_superhost - some Airbnb hosts are considered better than others, which this feature displays
host_listings_count - the number of listings the host has (per Airbnb calculations)
host_total_listings_count - the number of listings the host has (per Airbnb calculations)
neighborhood_cleansed - all neighborhoods present in Amsterdam
latitude - geographical coordinate going from North to South
longitude - geographical coordinate going from East to West
room_type - all room types available in Amsterdam on Airbnb
accommodates - maximum number of people that can fit in the Airbnb listing
bedrooms - number of bedrooms
beds - number of beds (does not reveal if they are two-person or one-person beds)
amenities - number of amenities present in the Airbnb listing
price - price per day in US dollars
minimum_nights - minimum number of nights you need to book for this listing
maximum_nights - maximum number of nights you are allowed to book this listing
has_availability - displays if this listing can be booked
availability_30 - number of available spots in the next 30 days
number_of_reviews_l30d - number of reviews created in the last 30 days
review_scores_rating - average rating given to the Airbnb listing
instant_bookable - whether the guest can automatically book the listing without the host requiring to accept their booking request. An indicator of a commercial listing.
price_per_person - price per person per night in US dollars
minimum_price - price in US dollars times minimum nights
discount_per_… - displays the discount when 5, 10, or 30+ days are booked
service_cost - total service cost paid at booking

Calendar data

df_cal.head()

	listing_id	date	available	price_in_dollar	minimum_nights	maximum_nights
0	23726706	2022-06-05	False	90.00	2	1125
1	23726706	2022-06-06	False	90.00	2	1125
2	23726706	2022-06-07	False	90.00	2	1125
3	23726706	2022-06-08	False	90.00	2	1125
4	23726706	2022-06-09	False	85.00	2	1125

Data dictionary for the calendar data

Here is the definition of the calendar data columns:

listing_id - Airbnb’s unique identifier for the listing
date - the date in the listing’s calendar
available - whether the date is available for a booking
price_in_dollar - the price listed for the day
minimum_nights - minimum nights for a booking made on this day
maximum_nights - maximum nights for a booking made on this day

Cleaning the Listings DataFrame

To get an overview of the Listings DataFrame, called df_list, we’ll need some details about the columns in the DataFrame, like the column names, their data types, and the number of non-null values.

df_list.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6165 entries, 0 to 6172
Data columns (total 34 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   id                                    6165 non-null   int64  
 1   host_acceptance_rate                  5365 non-null   float64
 2   host_is_superhost                     6165 non-null   object 
 3   host_listings_count                   6165 non-null   int64  
 4   host_total_listings_count             6165 non-null   int64  
 5   neighbourhood_cleansed                6165 non-null   object 
 6   latitude                              6165 non-null   float64
 7   longitude                             6165 non-null   float64
 8   room_type                             6165 non-null   object 
 9   accommodates                          6165 non-null   int64  
 10  bedrooms                              5859 non-null   float64
 11  beds                                  6082 non-null   float64
 12  amenities                             6165 non-null   int64  
 13  price                                 6165 non-null   object 
 14  minimum_nights                        6165 non-null   int64  
 15  maximum_nights                        6165 non-null   int64  
 16  has_availability                      6165 non-null   object 
 17  availability_30                       6165 non-null   int64  
 18  availability_60                       6165 non-null   int64  
 19  availability_90                       6165 non-null   int64  
 20  availability_365                      6165 non-null   int64  
 21  number_of_reviews                     6165 non-null   int64  
 22  number_of_reviews_ltm                 6165 non-null   int64  
 23  number_of_reviews_l30d                6165 non-null   int64  
 24  review_scores_rating                  5581 non-null   float64
 25  instant_bookable                      6165 non-null   object 
 26  reviews_per_month                     5581 non-null   float64
 27  price_in_euros                        0 non-null      object 
 28  price_per_person                      6165 non-null   object 
 29  minimum_price                         6165 non-null   object 
 30  discount_per_5_days_booked            6165 non-null   object 
 31  discount_per_10_days_booked           6165 non-null   object 
 32  discount_per_30_and_more_days_booked  6165 non-null   object 
 33  service_cost                          6165 non-null   object 
dtypes: float64(7), int64(14), object(13)
memory usage: 1.6+ MB

df_list = pd.read_pickle("data/airbnb_amsterdam_listings.pkl")

STEP 1 - Perform this four-step process to change each of the three discount_per_... columns into their proper format:

Remove non-numeric characters, like the percent symbol, so you can perform mathematical calculations on the column
Change the column into a float data type in order to convert the data into a ratio
Multiply the whole column by 0.01 so you end up with a probability ratio instead of a percentage
Overwrite the old discount_per_… column with this new column

df_list.loc[:,'discount_per_5_days_booked':'discount_per_30_and_more_days_booked'] = (
    df_list.loc[:,'discount_per_5_days_booked':'discount_per_30_and_more_days_booked']
    .replace('%', '', regex=True)
    .astype('float')
    .multiply(0.01)
)

df_list.loc[:,'discount_per_5_days_booked':'discount_per_30_and_more_days_booked'].head()

	discount_per_5_days_booked	discount_per_10_days_booked	discount_per_30_and_more_days_booked
0	0.05	0.11	0.16
1	0.05	0.12	0.16
2	0.07	0.11	0.22
3	0.06	0.10	0.15
4	0.09	0.22	0.20

STEP 2 - The columns host_is_superhost, instant_bookable, and has_availability are all boolean columns in the sense that their data represents true and false values, but currently are recognized as objects.

df_list[["host_is_superhost", "instant_bookable", "has_availability"]].head()

	host_is_superhost	instant_bookable	has_availability
0	f	t	t
1	t	f	t
2	f	f	t
3	f	f	t
4	t	f	t

The letters in these columns (t and f) are written as strings and not as boolean data types. This means we need to replace our string values with the boolean equivalent dtype.

Change the columns host_is_superhost, instant_bookable, and has_availability into a boolean data type for better data processing:

Replace f and t with False and True
Set the column as type bool
Overwrite the old columns with the new values

df_list[["host_is_superhost", "instant_bookable", "has_availability"]] = (
    df_list[["host_is_superhost", "instant_bookable", "has_availability"]]
    .replace({"f": False, "t": True})
    .astype('bool')
)

df_list[["host_is_superhost", "instant_bookable", "has_availability"]].head()

	host_is_superhost	instant_bookable	has_availability
0	False	True	True
1	True	False	True
2	False	False	True
3	False	False	True
4	True	False	True

STEP 3 - A closer look at the prices in the four columns price, price_per_person, minimum_price, and service_cost reveals that they all follow the same pattern:

df_list[["price", "price_per_person", "minimum_price", 'service_cost']].head()

	price	price_per_person	minimum_price	service_cost
0	$88.00	$44	$176	$4.99
1	$105.00	$52.5	$315	$4.99
2	$152.00	$38	$304	$4.99
3	$87.00	$43.5	$174	$4.99
4	$160.00	$40	$320	$4.99

df_list['price'] = (
    df_list['price']
    .str.replace('$', '', regex=True)
    .str.replace(',', '', regex=True)
    .astype('float')
)

df_list['price_per_person'] = (
    df_list['price_per_person']
    .str.replace('$', '', regex=True)
    .str.replace(',', '', regex=True)
    .astype('float')
)

df_list['minimum_price'] = (
    df_list['minimum_price']
    .str.replace('$', '', regex=True)
    .str.replace(',', '', regex=True)
    .astype('float')
)

df_list['service_cost'] = (
    df_list['service_cost']
    .str.replace('$', '', regex=True)
    .str.replace(',', '', regex=True)
    .astype('float')
)

df_list[["price", "price_per_person", "minimum_price", 'service_cost']].head()

	price	price_per_person	minimum_price	service_cost
0	88.00	44.00	176.00	4.99
1	105.00	52.50	315.00	4.99
2	152.00	38.00	304.00	4.99
3	87.00	43.50	174.00	4.99
4	160.00	40.00	320.00	4.99

STEP 4 - Renaming columns

The following column names need to be changed:

price into price_in_dollar
neighbourhood_cleansed into neighbourhood

df_list = df_list.rename(
  columns={
    "price": "price_in_dollar", 
    "neighbourhood_cleansed": "neighbourhood"
  }
)

STEP 5 Taking a closer look at the neighbourhood and room_type columns reveals that these columns are assigned an object dtype. We want them to be a category dtype.

df_list[['neighbourhood', 'room_type']] = (
    df_list[['neighbourhood', 'room_type']]
    .astype('category')
)

STEP 6 - Deleting irrelevant columns

We need to delete some columns that are irrelevant to this current use case.

Let’s inspect the price_in_euros column first, because this column seems to contain only null values, which inherently do not add any meaning to the dataset.

df_list['price_in_euros'].unique()

array([None], dtype=object)

The approach reveals that this column contains no unique values and is thus empty. We’ll trop it with other irrelevant columns.

df_list = df_list.drop(
    columns=[
        'price_in_euros',
        'host_listings_count', 
        'host_total_listings_count', 
        'availability_60',
        'availability_90', 
        'availability_365', 
        'number_of_reviews',
        'number_of_reviews_ltm', 
        'reviews_per_month'
    ]
)

Let’s now have a look at which data types we still need to change and which columns have some null values.

df_list.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6165 entries, 0 to 6172
Data columns (total 25 columns):
 #   Column                                Non-Null Count  Dtype   
---  ------                                --------------  -----   
 0   id                                    6165 non-null   int64   
 1   host_acceptance_rate                  5365 non-null   float64 
 2   host_is_superhost                     6165 non-null   bool    
 3   neighbourhood                         6165 non-null   category
 4   latitude                              6165 non-null   float64 
 5   longitude                             6165 non-null   float64 
 6   room_type                             6165 non-null   category
 7   accommodates                          6165 non-null   int64   
 8   bedrooms                              5859 non-null   float64 
 9   beds                                  6082 non-null   float64 
 10  amenities                             6165 non-null   int64   
 11  price_in_dollar                       6165 non-null   float64 
 12  minimum_nights                        6165 non-null   int64   
 13  maximum_nights                        6165 non-null   int64   
 14  has_availability                      6165 non-null   bool    
 15  availability_30                       6165 non-null   int64   
 16  number_of_reviews_l30d                6165 non-null   int64   
 17  review_scores_rating                  5581 non-null   float64 
 18  instant_bookable                      6165 non-null   bool    
 19  price_per_person                      6165 non-null   float64 
 20  minimum_price                         6165 non-null   float64 
 21  discount_per_5_days_booked            6165 non-null   float64 
 22  discount_per_10_days_booked           6165 non-null   float64 
 23  discount_per_30_and_more_days_booked  6165 non-null   float64 
 24  service_cost                          6165 non-null   float64 
dtypes: bool(3), category(2), float64(13), int64(7)
memory usage: 1.0 MB

STEP 7 - Some listings have no reviews and an unknown host acceptance rate. Most Airbnb users exclude such listings from their search results. To mimic this filtering approach, we’ll filter out any rows that do not have a review_scores_rating and without a `host_acceptance_rate.

df_list = df_list.dropna(subset=['review_scores_rating', 'host_acceptance_rate'])

STEP 8 - Which columns still have missing values and how much?

df_list.isnull().sum().loc[lambda x : x != 0]

bedrooms    264
beds         69
dtype: int64

Let’s make some simple assumptions based on the room_type assigned to the listing. Which room types are found in the dataset.

df_list["room_type"].value_counts()

Entire home/apt    3158
Private room       1633
Hotel room           67
Shared room          28
Name: room_type, dtype: int64

There are four room types. Let’s make the assumption that the columns bedrooms and beds are potentially influenced by room_type.

Therefore, we can make the following rules:

If you have a “Private room” or “Shared room” as room_type, then we believe the listing only has one bedroom.
If the listing has “Hotel room” or “Entire home/apt” as room_type, then we can divide the number of guests the listing accomodates by 2 and round up.
If any of these numbers are missing, then we can leave it empty.

Translate these requirements into a Python function, and you get:

def fill_empty_bedrooms(accommodates: int, bedrooms: int, room_type: str) -> int:
    if (room_type == "Private room") or (room_type == "Shared room"):
        return 1
    elif (room_type == "Hotel room") or (room_type == "Entire home/apt"):
        return np.ceil(accommodates / 2)
    else:
        return bedrooms

In the case of Pandas, we are using apply() to semi-vectorize our function, but secretly this function just implements something that mimics a for loop. Using a lambda together with apply() allows us to access multiple columns to generate an outcome.

This approach is often good enough, but not always, especially if you are dealing with large datasets. Below we will run the apply() function for output.

df_list["bedrooms"] = df_list[["accommodates", "bedrooms", "room_type"]].apply(
    lambda x: fill_empty_bedrooms(x["accommodates"], x["bedrooms"], x["room_type"]),
    axis=1,
)

Thanks to our logic and assumptions, most listings now have a proper amount of defined rooms. However, there are still a few listings without any number of rooms defined. Remove all rows/entries that have an empty bedrooms, beds.

df_list[["bedrooms", "beds"]].isnull().sum()

bedrooms     0
beds        69
dtype: int64

df_list = df_list.dropna(subset=["beds"])

df_list[["bedrooms", "beds"]].isnull().sum()

bedrooms    0
beds        0
dtype: int64

Memory usage

df_list.memory_usage(deep=True).sum()

Calendar DataFrame

df_cal.head(3)

	listing_id	date	available	price_in_dollar	minimum_nights	maximum_nights
0	23726706	2022-06-05	False	90.00	2	1125
1	23726706	2022-06-06	False	90.00	2	1125
2	23726706	2022-06-07	False	90.00	2	1125

Minimum stay

You are looking to stay for a minimum of 3 days. Since you are unsure when it will be booked, you’d like to exclude all listing_ids that go below that threshold of 3 days no matter what time of year. This is because you think that bookings with a minimum stay of 3 days are more likely to have discount prices.

With these excluded, you would like to see the expected booking price for 5 days:

Using the helper code, create a conditional index for entries listing_id that are above the 3 day threshold
Remove them using .loc[conditional_index] over the conditional index.
Calculate the price of booking a listing for 5 days by multiplying the current day multiplied by 5, and assign this to a column called five_day_dollar_price

# First start by making a copy, for debugging purposes
calendar_newdf = df_cal.copy()

include_list = calendar_newdf["minimum_nights"] >= 3

# Get all the listings with a minimum nights of 3+
# Use the include_list
calendar_newdf = calendar_newdf.loc[include_list]

calendar_newdf["five_day_dollar_price"] = calendar_newdf["price_in_dollar"] * 5

calendar_newdf.head(3)

	listing_id	date	available	price_in_dollar	minimum_nights	maximum_nights	five_day_dollar_price
365	35815036	2022-06-05	False	105.00	3	1125	525.00
367	35815036	2022-06-07	True	105.00	3	1125	525.00
368	35815036	2022-06-08	True	105.00	3	1125	525.00

Now let’s transform our newly created DataFrame into a pivot table, where we aggregate our rows using the listing_id as the index, and the columns available and five_day_dollar_price as values.

calendar_summarizeddf = pd.pivot_table(
    data=calendar_newdf,
    index="listing_id",
    values=["available", "five_day_dollar_price"],
    aggfunc=np.mean
)                      # .add_suffix('_avg')

calendar_summarizeddf.head()

	available	five_day_dollar_price
listing_id
2818	0.21	346.90
44391	0.00	1200.00
49552	0.46	1162.50
55709	0.00	818.68
55868	0.00	749.22

Maximum price and date

Can you make a pivot table that states the maximum price_in_dollar for every Airbnb listing?

temp_sum_df = pd.pivot_table(
    data=calendar_newdf,
    index=["listing_id"],
    values=["price_in_dollar"],
    aggfunc=np.max
).add_suffix('_max')

temp_sum_df.head()

	price_in_dollar_max
listing_id
2818	80.00
44391	240.00
49552	300.00
55709	250.00
55868	200.00

Merging

We are going to merge the pivot table that includes the five_day_dollar_price for each listing with listings_df. We have to keep in mind that we want to keep only those rows of Airbnb listing IDs that are present in both datasets.

final_df = pd.merge(
    df_list,
    calendar_summarizeddf,
    left_on=["id"],
    right_on=["listing_id"],
    how="inner"
)

final_df.head(3)

	id	host_acceptance_rate	host_is_superhost	neighbourhood	latitude	longitude	room_type	accommodates	bedrooms	beds	amenities	price_in_dollar	minimum_nights	maximum_nights	has_availability	availability_30	number_of_reviews_l30d	review_scores_rating	instant_bookable	price_per_person	minimum_price	discount_per_5_days_booked	discount_per_10_days_booked	discount_per_30_and_more_days_booked	service_cost	available	five_day_dollar_price
0	35815036	1.00	True	Noord-Oost	52.42	4.96	Entire home/apt	2	1.00	1.00	5	105.00	3	100	True	4	6	4.96	False	52.50	315.00	0.05	0.12	0.16	4.99	0.66	528.89
1	19572024	1.00	False	Watergraafsmeer	52.31	4.91	Entire home/apt	6	3.00	6.00	14	279.00	3	300	True	6	3	4.69	False	46.50	837.00	0.09	0.16	0.14	4.99	0.82	1496.55
2	2973384	0.38	False	Watergraafsmeer	52.31	4.91	Entire home/apt	5	3.00	3.00	7	185.00	6	21	True	0	0	4.83	False	37.00	1110.00	0.06	0.12	0.18	4.99	0.04	941.30

Grouping

Now, let’s perform a groupby where we look at the median values of five_day_dollar_price and review_scores_rating with respect to the room_type. Do these results match your intuition?

final_df.groupby(by='room_type')[['review_scores_rating', 'five_day_dollar_price']].median()

(final_df
    .groupby(by='room_type')
    .agg(review_scores_rating_median=('review_scores_rating', 'median'),
         five_day_dollar_price_median=('five_day_dollar_price', 'median')
    )
)

	review_scores_rating_median	five_day_dollar_price_median
room_type
Entire home/apt	4.88	975.00
Hotel room	4.56	1110.16
Private room	4.79	710.91
Shared room	4.60	724.11

You might have expected that shared rooms are the cheapest and thus have the lowest rating with respect to median scores. The same can’t be said for the most expensive option — a hotel room. Will this influence your future considerations when booking 🤔?

(But before you let this influence your decisions too much, it might be better to assume that this data might be biased in favor of Airbnb and not hotels in general. 🤷)

Memory optimization

final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2883 entries, 0 to 2882
Data columns (total 27 columns):
 #   Column                                Non-Null Count  Dtype   
---  ------                                --------------  -----   
 0   id                                    2883 non-null   int64   
 1   host_acceptance_rate                  2883 non-null   float64 
 2   host_is_superhost                     2883 non-null   bool    
 3   neighbourhood                         2883 non-null   category
 4   latitude                              2883 non-null   float64 
 5   longitude                             2883 non-null   float64 
 6   room_type                             2883 non-null   category
 7   accommodates                          2883 non-null   int64   
 8   bedrooms                              2883 non-null   float64 
 9   beds                                  2883 non-null   float64 
 10  amenities                             2883 non-null   int64   
 11  price_in_dollar                       2883 non-null   float64 
 12  minimum_nights                        2883 non-null   int64   
 13  maximum_nights                        2883 non-null   int64   
 14  has_availability                      2883 non-null   bool    
 15  availability_30                       2883 non-null   int64   
 16  number_of_reviews_l30d                2883 non-null   int64   
 17  review_scores_rating                  2883 non-null   float64 
 18  instant_bookable                      2883 non-null   bool    
 19  price_per_person                      2883 non-null   float64 
 20  minimum_price                         2883 non-null   float64 
 21  discount_per_5_days_booked            2883 non-null   float64 
 22  discount_per_10_days_booked           2883 non-null   float64 
 23  discount_per_30_and_more_days_booked  2883 non-null   float64 
 24  service_cost                          2883 non-null   float64 
 25  available                             2883 non-null   float64 
 26  five_day_dollar_price                 2883 non-null   float64 
dtypes: bool(3), category(2), float64(15), int64(7)
memory usage: 533.0 KB