Streamlit app - Airbnb Amsterdam Listings

Cleaning data with Pandas and creating a streamlit app.
python
sreamlit
data cleaning
data wrangling
Author

Sandra Jurela

Published

May 25, 2023

Introduction

Data overview

Setting up the programming environment

# import libraries
import numpy as np
import pandas as pd

# show all columns (instead of cascading columns in the middle)
pd.set_option("display.max_columns", None)
# don"t show numbers in scientific notation
pd.set_option("display.float_format", "{:.2f}".format)

# read the Python Pickle and PyArrow Parquet files
df_list = pd.read_pickle("data/airbnb_amsterdam_listings.pkl")
df_cal = pd.read_parquet("data/airbnb_amsterdam_calendar.parquet", engine="pyarrow")

Listings data

df_list.head()
id host_acceptance_rate host_is_superhost host_listings_count host_total_listings_count neighbourhood_cleansed latitude longitude room_type accommodates bedrooms beds amenities price minimum_nights maximum_nights has_availability availability_30 availability_60 availability_90 availability_365 number_of_reviews number_of_reviews_ltm number_of_reviews_l30d review_scores_rating instant_bookable reviews_per_month price_in_euros price_per_person minimum_price discount_per_5_days_booked discount_per_10_days_booked discount_per_30_and_more_days_booked service_cost
0 23726706 0.95 f 1 1 IJburg - Zeeburgereiland 52.35 4.98 Private room 2 1.00 1.00 6 $88.00 2 14 t 0 6 20 66 78 11 3 4.99 t 1.53 None $44 $176 5% 11% 16% $4.99
1 35815036 1.00 t 1 1 Noord-Oost 52.42 4.96 Entire home/apt 2 NaN 1.00 5 $105.00 3 100 t 4 6 12 243 95 36 6 4.96 f 2.65 None $52.5 $315 5% 12% 16% $4.99
2 31553121 1.00 f 1 1 Noord-West 52.43 4.92 Entire home/apt 4 1.00 3.00 3 $152.00 2 60 t 0 3 3 3 82 26 1 4.74 f 2.02 None $38 $304 7% 11% 22% $4.99
3 34745823 0.94 f 3 3 Gaasperdam - Driemond 52.30 5.01 Entire home/apt 2 1.00 2.00 8 $87.00 2 1125 t 5 20 26 290 39 4 0 4.87 f 1.08 None $43.5 $174 6% 10% 15% $4.99
4 44586947 0.88 t 0 0 Gaasperdam - Driemond 52.31 5.03 Private room 4 2.00 3.00 4 $160.00 2 31 t 9 32 62 152 15 12 3 5.00 f 0.68 None $40 $320 9% 22% 20% $4.99


This dataset needs extensive cleaning which will be performed in the next step.

Data dictionary for the listings data

Here is the definition of the listings data columns:

  • id - Airbnb’s unique identifier for the listing
  • host_acceptance_rate - rate at which a host accepts booking request
  • host_is_superhost - some Airbnb hosts are considered better than others, which this feature displays
  • host_listings_count - the number of listings the host has (per Airbnb calculations)
  • host_total_listings_count - the number of listings the host has (per Airbnb calculations)
  • neighborhood_cleansed - all neighborhoods present in Amsterdam
  • latitude - geographical coordinate going from North to South
  • longitude - geographical coordinate going from East to West
  • room_type - all room types available in Amsterdam on Airbnb
  • accommodates - maximum number of people that can fit in the Airbnb listing
  • bedrooms - number of bedrooms
  • beds - number of beds (does not reveal if they are two-person or one-person beds)
  • amenities - number of amenities present in the Airbnb listing
  • price - price per day in US dollars
  • minimum_nights - minimum number of nights you need to book for this listing
  • maximum_nights - maximum number of nights you are allowed to book this listing
  • has_availability - displays if this listing can be booked
  • availability_30 - number of available spots in the next 30 days
  • number_of_reviews_l30d - number of reviews created in the last 30 days
  • review_scores_rating - average rating given to the Airbnb listing
  • instant_bookable - whether the guest can automatically book the listing without the host requiring to accept their booking request. An indicator of a commercial listing.
  • price_per_person - price per person per night in US dollars
  • minimum_price - price in US dollars times minimum nights
  • discount_per_… - displays the discount when 5, 10, or 30+ days are booked
  • service_cost - total service cost paid at booking

Calendar data

df_cal.head()
listing_id date available price_in_dollar minimum_nights maximum_nights
0 23726706 2022-06-05 False 90.00 2 1125
1 23726706 2022-06-06 False 90.00 2 1125
2 23726706 2022-06-07 False 90.00 2 1125
3 23726706 2022-06-08 False 90.00 2 1125
4 23726706 2022-06-09 False 85.00 2 1125

Data dictionary for the calendar data

Here is the definition of the calendar data columns:

  • listing_id - Airbnb’s unique identifier for the listing
  • date - the date in the listing’s calendar
  • available - whether the date is available for a booking
  • price_in_dollar - the price listed for the day
  • minimum_nights - minimum nights for a booking made on this day
  • maximum_nights - maximum nights for a booking made on this day

Cleaning the Listings DataFrame

To get an overview of the Listings DataFrame, called df_list, we’ll need some details about the columns in the DataFrame, like the column names, their data types, and the number of non-null values.

df_list.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6165 entries, 0 to 6172
Data columns (total 34 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   id                                    6165 non-null   int64  
 1   host_acceptance_rate                  5365 non-null   float64
 2   host_is_superhost                     6165 non-null   object 
 3   host_listings_count                   6165 non-null   int64  
 4   host_total_listings_count             6165 non-null   int64  
 5   neighbourhood_cleansed                6165 non-null   object 
 6   latitude                              6165 non-null   float64
 7   longitude                             6165 non-null   float64
 8   room_type                             6165 non-null   object 
 9   accommodates                          6165 non-null   int64  
 10  bedrooms                              5859 non-null   float64
 11  beds                                  6082 non-null   float64
 12  amenities                             6165 non-null   int64  
 13  price                                 6165 non-null   object 
 14  minimum_nights                        6165 non-null   int64  
 15  maximum_nights                        6165 non-null   int64  
 16  has_availability                      6165 non-null   object 
 17  availability_30                       6165 non-null   int64  
 18  availability_60                       6165 non-null   int64  
 19  availability_90                       6165 non-null   int64  
 20  availability_365                      6165 non-null   int64  
 21  number_of_reviews                     6165 non-null   int64  
 22  number_of_reviews_ltm                 6165 non-null   int64  
 23  number_of_reviews_l30d                6165 non-null   int64  
 24  review_scores_rating                  5581 non-null   float64
 25  instant_bookable                      6165 non-null   object 
 26  reviews_per_month                     5581 non-null   float64
 27  price_in_euros                        0 non-null      object 
 28  price_per_person                      6165 non-null   object 
 29  minimum_price                         6165 non-null   object 
 30  discount_per_5_days_booked            6165 non-null   object 
 31  discount_per_10_days_booked           6165 non-null   object 
 32  discount_per_30_and_more_days_booked  6165 non-null   object 
 33  service_cost                          6165 non-null   object 
dtypes: float64(7), int64(14), object(13)
memory usage: 1.6+ MB
df_list = pd.read_pickle("data/airbnb_amsterdam_listings.pkl")

STEP 1 - Perform this four-step process to change each of the three discount_per_... columns into their proper format:

  • Remove non-numeric characters, like the percent symbol, so you can perform mathematical calculations on the column
  • Change the column into a float data type in order to convert the data into a ratio
  • Multiply the whole column by 0.01 so you end up with a probability ratio instead of a percentage
  • Overwrite the old discount_per_… column with this new column
df_list.loc[:,'discount_per_5_days_booked':'discount_per_30_and_more_days_booked'] = (
    df_list.loc[:,'discount_per_5_days_booked':'discount_per_30_and_more_days_booked']
    .replace('%', '', regex=True)
    .astype('float')
    .multiply(0.01)
)

df_list.loc[:,'discount_per_5_days_booked':'discount_per_30_and_more_days_booked'].head()
discount_per_5_days_booked discount_per_10_days_booked discount_per_30_and_more_days_booked
0 0.05 0.11 0.16
1 0.05 0.12 0.16
2 0.07 0.11 0.22
3 0.06 0.10 0.15
4 0.09 0.22 0.20

STEP 2 - The columns host_is_superhost, instant_bookable, and has_availability are all boolean columns in the sense that their data represents true and false values, but currently are recognized as objects.

df_list[["host_is_superhost", "instant_bookable", "has_availability"]].head()
host_is_superhost instant_bookable has_availability
0 f t t
1 t f t
2 f f t
3 f f t
4 t f t

The letters in these columns (t and f) are written as strings and not as boolean data types. This means we need to replace our string values with the boolean equivalent dtype.

Change the columns host_is_superhost, instant_bookable, and has_availability into a boolean data type for better data processing:

  • Replace f and t with False and True
  • Set the column as type bool
  • Overwrite the old columns with the new values
df_list[["host_is_superhost", "instant_bookable", "has_availability"]] = (
    df_list[["host_is_superhost", "instant_bookable", "has_availability"]]
    .replace({"f": False, "t": True})
    .astype('bool')
)

df_list[["host_is_superhost", "instant_bookable", "has_availability"]].head()
host_is_superhost instant_bookable has_availability
0 False True True
1 True False True
2 False False True
3 False False True
4 True False True

STEP 3 - A closer look at the prices in the four columns price, price_per_person, minimum_price, and service_cost reveals that they all follow the same pattern:

df_list[["price", "price_per_person", "minimum_price", 'service_cost']].head()
price price_per_person minimum_price service_cost
0 $88.00 $44 $176 $4.99
1 $105.00 $52.5 $315 $4.99
2 $152.00 $38 $304 $4.99
3 $87.00 $43.5 $174 $4.99
4 $160.00 $40 $320 $4.99
df_list['price'] = (
    df_list['price']
    .str.replace('$', '', regex=True)
    .str.replace(',', '', regex=True)
    .astype('float')
)

df_list['price_per_person'] = (
    df_list['price_per_person']
    .str.replace('$', '', regex=True)
    .str.replace(',', '', regex=True)
    .astype('float')
)

df_list['minimum_price'] = (
    df_list['minimum_price']
    .str.replace('$', '', regex=True)
    .str.replace(',', '', regex=True)
    .astype('float')
)

df_list['service_cost'] = (
    df_list['service_cost']
    .str.replace('$', '', regex=True)
    .str.replace(',', '', regex=True)
    .astype('float')
)

df_list[["price", "price_per_person", "minimum_price", 'service_cost']].head()
price price_per_person minimum_price service_cost
0 88.00 44.00 176.00 4.99
1 105.00 52.50 315.00 4.99
2 152.00 38.00 304.00 4.99
3 87.00 43.50 174.00 4.99
4 160.00 40.00 320.00 4.99

STEP 4 - Renaming columns

The following column names need to be changed:

  • price into price_in_dollar
  • neighbourhood_cleansed into neighbourhood
df_list = df_list.rename(
  columns={
    "price": "price_in_dollar", 
    "neighbourhood_cleansed": "neighbourhood"
  }
)

STEP 5 Taking a closer look at the neighbourhood and room_type columns reveals that these columns are assigned an object dtype. We want them to be a category dtype.

df_list[['neighbourhood', 'room_type']] = (
    df_list[['neighbourhood', 'room_type']]
    .astype('category')
)

STEP 6 - Deleting irrelevant columns

We need to delete some columns that are irrelevant to this current use case.

Let’s inspect the price_in_euros column first, because this column seems to contain only null values, which inherently do not add any meaning to the dataset.

df_list['price_in_euros'].unique()
array([None], dtype=object)

The approach reveals that this column contains no unique values and is thus empty. We’ll trop it with other irrelevant columns.

df_list = df_list.drop(
    columns=[
        'price_in_euros',
        'host_listings_count', 
        'host_total_listings_count', 
        'availability_60',
        'availability_90', 
        'availability_365', 
        'number_of_reviews',
        'number_of_reviews_ltm', 
        'reviews_per_month'
    ]
)

Let’s now have a look at which data types we still need to change and which columns have some null values.

df_list.info(verbose=True, show_counts=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6165 entries, 0 to 6172
Data columns (total 25 columns):
 #   Column                                Non-Null Count  Dtype   
---  ------                                --------------  -----   
 0   id                                    6165 non-null   int64   
 1   host_acceptance_rate                  5365 non-null   float64 
 2   host_is_superhost                     6165 non-null   bool    
 3   neighbourhood                         6165 non-null   category
 4   latitude                              6165 non-null   float64 
 5   longitude                             6165 non-null   float64 
 6   room_type                             6165 non-null   category
 7   accommodates                          6165 non-null   int64   
 8   bedrooms                              5859 non-null   float64 
 9   beds                                  6082 non-null   float64 
 10  amenities                             6165 non-null   int64   
 11  price_in_dollar                       6165 non-null   float64 
 12  minimum_nights                        6165 non-null   int64   
 13  maximum_nights                        6165 non-null   int64   
 14  has_availability                      6165 non-null   bool    
 15  availability_30                       6165 non-null   int64   
 16  number_of_reviews_l30d                6165 non-null   int64   
 17  review_scores_rating                  5581 non-null   float64 
 18  instant_bookable                      6165 non-null   bool    
 19  price_per_person                      6165 non-null   float64 
 20  minimum_price                         6165 non-null   float64 
 21  discount_per_5_days_booked            6165 non-null   float64 
 22  discount_per_10_days_booked           6165 non-null   float64 
 23  discount_per_30_and_more_days_booked  6165 non-null   float64 
 24  service_cost                          6165 non-null   float64 
dtypes: bool(3), category(2), float64(13), int64(7)
memory usage: 1.0 MB

STEP 7 - Some listings have no reviews and an unknown host acceptance rate. Most Airbnb users exclude such listings from their search results. To mimic this filtering approach, we’ll filter out any rows that do not have a review_scores_rating and without a `host_acceptance_rate.

df_list = df_list.dropna(subset=['review_scores_rating', 'host_acceptance_rate'])

STEP 8 - Which columns still have missing values and how much?

df_list.isnull().sum().loc[lambda x : x != 0]
bedrooms    264
beds         69
dtype: int64

Let’s make some simple assumptions based on the room_type assigned to the listing. Which room types are found in the dataset.

df_list["room_type"].value_counts()
Entire home/apt    3158
Private room       1633
Hotel room           67
Shared room          28
Name: room_type, dtype: int64

There are four room types. Let’s make the assumption that the columns bedrooms and beds are potentially influenced by room_type.

Therefore, we can make the following rules:

  • If you have a “Private room” or “Shared room” as room_type, then we believe the listing only has one bedroom.
  • If the listing has “Hotel room” or “Entire home/apt” as room_type, then we can divide the number of guests the listing accomodates by 2 and round up.
  • If any of these numbers are missing, then we can leave it empty.

Translate these requirements into a Python function, and you get:

def fill_empty_bedrooms(accommodates: int, bedrooms: int, room_type: str) -> int:
    if (room_type == "Private room") or (room_type == "Shared room"):
        return 1
    elif (room_type == "Hotel room") or (room_type == "Entire home/apt"):
        return np.ceil(accommodates / 2)
    else:
        return bedrooms

In the case of Pandas, we are using apply() to semi-vectorize our function, but secretly this function just implements something that mimics a for loop. Using a lambda together with apply() allows us to access multiple columns to generate an outcome.

This approach is often good enough, but not always, especially if you are dealing with large datasets. Below we will run the apply() function for output.

df_list["bedrooms"] = df_list[["accommodates", "bedrooms", "room_type"]].apply(
    lambda x: fill_empty_bedrooms(x["accommodates"], x["bedrooms"], x["room_type"]),
    axis=1,
)

Thanks to our logic and assumptions, most listings now have a proper amount of defined rooms. However, there are still a few listings without any number of rooms defined. Remove all rows/entries that have an empty bedrooms, beds.

df_list[["bedrooms", "beds"]].isnull().sum()
bedrooms     0
beds        69
dtype: int64
df_list = df_list.dropna(subset=["beds"])
df_list[["bedrooms", "beds"]].isnull().sum()
bedrooms    0
beds        0
dtype: int64
Memory usage
df_list.memory_usage(deep=True).sum()
835941

Calendar DataFrame

df_cal.head(3)
listing_id date available price_in_dollar minimum_nights maximum_nights
0 23726706 2022-06-05 False 90.00 2 1125
1 23726706 2022-06-06 False 90.00 2 1125
2 23726706 2022-06-07 False 90.00 2 1125

Minimum stay

You are looking to stay for a minimum of 3 days. Since you are unsure when it will be booked, you’d like to exclude all listing_ids that go below that threshold of 3 days no matter what time of year. This is because you think that bookings with a minimum stay of 3 days are more likely to have discount prices.

With these excluded, you would like to see the expected booking price for 5 days:

  • Using the helper code, create a conditional index for entries listing_id that are above the 3 day threshold
  • Remove them using .loc[conditional_index] over the conditional index.
  • Calculate the price of booking a listing for 5 days by multiplying the current day multiplied by 5, and assign this to a column called five_day_dollar_price
# First start by making a copy, for debugging purposes
calendar_newdf = df_cal.copy()

include_list = calendar_newdf["minimum_nights"] >= 3

# Get all the listings with a minimum nights of 3+
# Use the include_list
calendar_newdf = calendar_newdf.loc[include_list]

calendar_newdf["five_day_dollar_price"] = calendar_newdf["price_in_dollar"] * 5
calendar_newdf.head(3)
listing_id date available price_in_dollar minimum_nights maximum_nights five_day_dollar_price
365 35815036 2022-06-05 False 105.00 3 1125 525.00
367 35815036 2022-06-07 True 105.00 3 1125 525.00
368 35815036 2022-06-08 True 105.00 3 1125 525.00

Now let’s transform our newly created DataFrame into a pivot table, where we aggregate our rows using the listing_id as the index, and the columns available and five_day_dollar_price as values.

calendar_summarizeddf = pd.pivot_table(
    data=calendar_newdf,
    index="listing_id",
    values=["available", "five_day_dollar_price"],
    aggfunc=np.mean
)                      # .add_suffix('_avg')

calendar_summarizeddf.head()
available five_day_dollar_price
listing_id
2818 0.21 346.90
44391 0.00 1200.00
49552 0.46 1162.50
55709 0.00 818.68
55868 0.00 749.22

Maximum price and date

Can you make a pivot table that states the maximum price_in_dollar for every Airbnb listing?

temp_sum_df = pd.pivot_table(
    data=calendar_newdf,
    index=["listing_id"],
    values=["price_in_dollar"],
    aggfunc=np.max
).add_suffix('_max')

temp_sum_df.head()
price_in_dollar_max
listing_id
2818 80.00
44391 240.00
49552 300.00
55709 250.00
55868 200.00

Merging

We are going to merge the pivot table that includes the five_day_dollar_price for each listing with listings_df. We have to keep in mind that we want to keep only those rows of Airbnb listing IDs that are present in both datasets.

final_df = pd.merge(
    df_list,
    calendar_summarizeddf,
    left_on=["id"],
    right_on=["listing_id"],
    how="inner"
)

final_df.head(3)
id host_acceptance_rate host_is_superhost neighbourhood latitude longitude room_type accommodates bedrooms beds amenities price_in_dollar minimum_nights maximum_nights has_availability availability_30 number_of_reviews_l30d review_scores_rating instant_bookable price_per_person minimum_price discount_per_5_days_booked discount_per_10_days_booked discount_per_30_and_more_days_booked service_cost available five_day_dollar_price
0 35815036 1.00 True Noord-Oost 52.42 4.96 Entire home/apt 2 1.00 1.00 5 105.00 3 100 True 4 6 4.96 False 52.50 315.00 0.05 0.12 0.16 4.99 0.66 528.89
1 19572024 1.00 False Watergraafsmeer 52.31 4.91 Entire home/apt 6 3.00 6.00 14 279.00 3 300 True 6 3 4.69 False 46.50 837.00 0.09 0.16 0.14 4.99 0.82 1496.55
2 2973384 0.38 False Watergraafsmeer 52.31 4.91 Entire home/apt 5 3.00 3.00 7 185.00 6 21 True 0 0 4.83 False 37.00 1110.00 0.06 0.12 0.18 4.99 0.04 941.30

Grouping

Now, let’s perform a groupby where we look at the median values of five_day_dollar_price and review_scores_rating with respect to the room_type. Do these results match your intuition?

final_df.groupby(by='room_type')[['review_scores_rating', 'five_day_dollar_price']].median()
(final_df
    .groupby(by='room_type')
    .agg(review_scores_rating_median=('review_scores_rating', 'median'),
         five_day_dollar_price_median=('five_day_dollar_price', 'median')
    )
)
review_scores_rating_median five_day_dollar_price_median
room_type
Entire home/apt 4.88 975.00
Hotel room 4.56 1110.16
Private room 4.79 710.91
Shared room 4.60 724.11

You might have expected that shared rooms are the cheapest and thus have the lowest rating with respect to median scores. The same can’t be said for the most expensive option — a hotel room. Will this influence your future considerations when booking 🤔?

(But before you let this influence your decisions too much, it might be better to assume that this data might be biased in favor of Airbnb and not hotels in general. 🤷)

Memory optimization

final_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2883 entries, 0 to 2882
Data columns (total 27 columns):
 #   Column                                Non-Null Count  Dtype   
---  ------                                --------------  -----   
 0   id                                    2883 non-null   int64   
 1   host_acceptance_rate                  2883 non-null   float64 
 2   host_is_superhost                     2883 non-null   bool    
 3   neighbourhood                         2883 non-null   category
 4   latitude                              2883 non-null   float64 
 5   longitude                             2883 non-null   float64 
 6   room_type                             2883 non-null   category
 7   accommodates                          2883 non-null   int64   
 8   bedrooms                              2883 non-null   float64 
 9   beds                                  2883 non-null   float64 
 10  amenities                             2883 non-null   int64   
 11  price_in_dollar                       2883 non-null   float64 
 12  minimum_nights                        2883 non-null   int64   
 13  maximum_nights                        2883 non-null   int64   
 14  has_availability                      2883 non-null   bool    
 15  availability_30                       2883 non-null   int64   
 16  number_of_reviews_l30d                2883 non-null   int64   
 17  review_scores_rating                  2883 non-null   float64 
 18  instant_bookable                      2883 non-null   bool    
 19  price_per_person                      2883 non-null   float64 
 20  minimum_price                         2883 non-null   float64 
 21  discount_per_5_days_booked            2883 non-null   float64 
 22  discount_per_10_days_booked           2883 non-null   float64 
 23  discount_per_30_and_more_days_booked  2883 non-null   float64 
 24  service_cost                          2883 non-null   float64 
 25  available                             2883 non-null   float64 
 26  five_day_dollar_price                 2883 non-null   float64 
dtypes: bool(3), category(2), float64(15), int64(7)
memory usage: 533.0 KB

Streamlit application