Test - Altair Viz

Testing Altair rendering.
python
altair
Author

Sandra Jurela

Published

May 20, 2023

Warning

This is a notebook about Vancouver street trees that is not yet finished!!

What do you think about it?

Imports

# Import libraries 
import altair as alt
import pandas as pd
import json

# embed the dataset within the visualization specification
alt.data_transformers.disable_max_rows()

trees_df = pd.read_csv('data/vancouver_trees.csv')

# Glance at the original df
trees_df
std_street on_street species_name neighbourhood_name date_planted diameter street_side_name genus_name assigned civic_number plant_area curb tree_id common_name height_range_id on_street_block cultivar_name root_barrier latitude longitude
0 W 13TH AV MAPLE ST PSEUDOPLATANUS Kitsilano NaN 9.00 EVEN ACER N 1996 10 Y 13310 SYCAMORE MAPLE 4 2900 NaN N 49.259856 -123.150586
1 WALES ST WALES ST PLATANOIDES Renfrew-Collingwood 2018-11-28 3.00 ODD ACER N 5291 7 Y 259084 PRINCETON GOLD MAPLE 1 5200 PRINCETON GOLD N 49.236650 -123.051831
2 W BROADWAY W BROADWAY RUBRUM Kitsilano 1996-04-19 14.00 EVEN ACER N 3618 C Y 167986 KARPICK RED MAPLE 3 3600 KARPICK N 49.264250 -123.184020
3 PENTICTON ST PENTICTON ST CALLERYANA Renfrew-Collingwood 2006-03-06 3.75 EVEN PYRUS N 2502 5 Y 213386 CHANTICLEER PEAR 1 2500 CHANTICLEER Y 49.261036 -123.052921
4 RHODES ST RHODES ST GLYPTOSTROBOIDES Renfrew-Collingwood 2001-11-01 3.00 ODD METASEQUOIA N 5639 N Y 189223 DAWN REDWOOD 2 5600 NaN N 49.233354 -123.050249
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
29995 ROBSON ST ROBSON ST CAMPESTRE West End NaN 7.00 ODD ACER N 1015 c Y 122814 HEDGE MAPLE 2 1000 NaN N 49.283666 -123.123231
29996 OSLER ST CONNAUGHT DRIVE PLATANOIDES Shaughnessy 2007-04-16 8.00 ODD ACER N 4690 10 Y 132211 NORWAY MAPLE 1 1000 NaN Y 49.243636 -123.129480
29997 BEATRICE ST BEATRICE ST CERASIFERA Victoria-Fraserview NaN 17.30 EVEN PRUNUS N 6218 9 Y 59355 PISSARD PLUM 3 6200 ATROPURPUREUM N 49.227406 -123.066936
29998 ANGUS DRIVE ANGUS DRIVE BILOBA Shaughnessy 2006-02-17 4.00 ODD GINKGO N 1551 9 Y 207753 GINKGO OR MAIDENHAIR TREE 1 1500 NaN Y 49.254431 -123.140382
29999 MAIN ST MAIN ST EUCHLORA X Riley Park NaN 12.00 ODD TILIA N 4323 C Y 92997 CRIMEAN LINDEN 4 4300 NaN N 49.246969 -123.101328

30000 rows × 20 columns


Identify and drop irrelevant columns

# Check columns of the original df
trees_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   std_street          30000 non-null  object 
 1   on_street           30000 non-null  object 
 2   species_name        30000 non-null  object 
 3   neighbourhood_name  30000 non-null  object 
 4   date_planted        14085 non-null  object 
 5   diameter            30000 non-null  float64
 6   street_side_name    30000 non-null  object 
 7   genus_name          30000 non-null  object 
 8   assigned            30000 non-null  object 
 9   civic_number        30000 non-null  int64  
 10  plant_area          29722 non-null  object 
 11  curb                30000 non-null  object 
 12  tree_id             30000 non-null  int64  
 13  common_name         30000 non-null  object 
 14  height_range_id     30000 non-null  int64  
 15  on_street_block     30000 non-null  int64  
 16  cultivar_name       16178 non-null  object 
 17  root_barrier        30000 non-null  object 
 18  latitude            30000 non-null  float64
 19  longitude           30000 non-null  float64
dtypes: float64(3), int64(4), object(13)
memory usage: 4.6+ MB

Based on the above data information and the dataset schema from City of Vancouver Open Data Portal - Street Trees, the columns can be preliminarily identified as four groups:

  1. Trees biological classifications and names, such as genus_name, species_name, common_name, cultivar_name
  2. Trees growth related characteristics, such as date_planted, diameter, height_range_id
  3. Trees coordinates and areas, such as latitude, longitude, neighbourhood_name
  4. Other specific location / orientation / identification information

As per the questions, the irrelevant columns under the fourth group and trees coordinates will be dropped. Also, to narrow down the focus to the highest level of tree classification, the columns of species_name, common_name, cultivar_name will also be dropped and only keep the column of genus_name.

trees_df = pd.read_csv('data/vancouver_trees.csv',
                       usecols=['neighbourhood_name',
                                'date_planted',
                                'diameter', 
                                'genus_name',
                                'height_range_id'],
                       parse_dates=['date_planted'])
                                    
trees_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   neighbourhood_name  30000 non-null  object        
 1   date_planted        14085 non-null  datetime64[ns]
 2   diameter            30000 non-null  float64       
 3   genus_name          30000 non-null  object        
 4   height_range_id     30000 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 1.1+ MB

Deal with null values

Based on the df information, the column of date_planted has almost half of values missed. Since reviewing trees growth is one of the objectives in this analysis, any observations without date of planted will be considered as uninformative data and dropped from the original df.

This will not cause a problem with the representativeness of the data. Just trust me for now! :)

# Drop observations without value of date_planted
trees_df = trees_df.dropna(subset=['date_planted'])

trees_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 14085 entries, 1 to 29998
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   neighbourhood_name  14085 non-null  object        
 1   date_planted        14085 non-null  datetime64[ns]
 2   diameter            14085 non-null  float64       
 3   genus_name          14085 non-null  object        
 4   height_range_id     14085 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 660.2+ KB

Further exam for error data

# Exam the remaining columns for more details
trees_df.describe(include='all', datetime_is_numeric=True)
neighbourhood_name date_planted diameter genus_name height_range_id
count 14085 14085 14085.000000 14085 14085.000000
unique 22 NaN NaN 68 NaN
top Renfrew-Collingwood NaN NaN ACER NaN
freq 1323 NaN NaN 3970 NaN
mean NaN 2003-09-20 17:40:42.172523904 6.352586 NaN 1.822932
min NaN 1989-10-27 00:00:00 0.000000 NaN 0.000000
25% NaN 1997-12-15 00:00:00 3.000000 NaN 1.000000
50% NaN 2003-04-01 00:00:00 5.000000 NaN 2.000000
75% NaN 2009-11-13 00:00:00 8.000000 NaN 2.000000
max NaN 2019-06-03 00:00:00 317.000000 NaN 9.000000
std NaN NaN 5.273737 NaN 0.983476


It has been noticed that the minimum of diameter and height_range_id are zero. For height_range_id, 0 represents the range of height is from 0 to 10 ft. However, since the diameter figure is the diameter of tree at breast height, it should not have the value of 0. So these observations will be considered as invalid data and removed from the df.

indexs = trees_df[trees_df['diameter'] == 0].index

trees_df.drop(indexs, inplace=True)

trees_df.describe(include='all', datetime_is_numeric=True)
neighbourhood_name date_planted diameter genus_name height_range_id
count 14083 14083 14083.000000 14083 14083.000000
unique 22 NaN NaN 68 NaN
top Renfrew-Collingwood NaN NaN ACER NaN
freq 1323 NaN NaN 3970 NaN
mean NaN 2003-09-20 23:57:38.893701504 6.353489 NaN 1.822978
min NaN 1989-10-27 00:00:00 0.500000 NaN 0.000000
25% NaN 1997-12-15 12:00:00 3.000000 NaN 1.000000
50% NaN 2003-04-01 00:00:00 5.000000 NaN 2.000000
75% NaN 2009-11-13 00:00:00 8.000000 NaN 2.000000
max NaN 2019-06-03 00:00:00 317.000000 NaN 9.000000
std NaN NaN 5.273568 NaN 0.983520


Dataset Description

The cleaned target dataframe trees_df is composed of 5 columns. There are 68 distinct genera within the total of 14,083 trees (observations). According to City of Vancouver Open Data Portal - Street Trees where the dataset was originally obtained, the brief descriptions of columns are listed as below:

  • Categorical columns

neighbourhood_name: City’s defined local area in which the tree is located.

genus_name: Genus name of trees.

  • Quantitative columns

diameter: DBH in inches (DBH stands for diameter of tree at breast height).

height_range_id: 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and10 = 100+ ft).

  • Datetime columns

date_planted: The date of planting.

Q1

Q2

In the past 30 years, how many trees have been planted every single year? And what is the number of trees planted by genus each year?

The dataframe has provide the specific date of being planted for each tree. From the available data, the age of current trees will be calculated first and then plot the number of trees planted by year. Since the accuracy is not a priority in this case, only the year of tree planted will be extracted for calculating the age till 2021.

# Extract year of planted and calculate age of trees
trees_df_yr = trees_df.assign(year=trees_df['date_planted'].dt.year)

# Calculate the age of trees till 2021.
trees_df_age = trees_df_yr.assign(age=(2021-trees_df_yr['year']))

# Plot distribution of number of trees planted by year
plot_2_title = alt.TitleParams(
    "Figure 2 Number of street trees planted each year",
     subtitle = "(Data available from 1989 to 2019)")

plot_2_year = alt.Chart(trees_df_age).mark_bar().encode(
    alt.X('year:N',title=None),
    alt.Y('count():Q',title='Number of trees planted')).properties(title=plot_2_title)

plot_2_year

Figure 2 indicates that there was a peak period between 1995 and 2013 to plant street trees by the City of Vancouver. During this period, the city had planted the highest number of trees in a single year in 1998 and in 2013. Before 1995 and after 2014, the number of trees planted was relatively lower, especially in 2016 when there were less than 50 new trees on public boulevards in Vancouver. Urban forestry is a systemic project. How many trees are planted is determined by a group of factors, such as public budget, tree replacement plan due to species distribution, insects, diseases, or environmental stress, etc. It is a clear message that the City of Vancouver has maintained a dynamic public trees planting program that benefits the wellbeings of residents in Vancouver.

Moving further, to plot number of trees planted by year and add dropdown selection by genus.

# Specify the subtitle color and bold it to draw attention
plot_3_title = alt.TitleParams(
    "Figure 3 Number of street trees planted each year by genus(from 1989 to 2019)",
    subtitle = "Dropdown selection is available by genus",
    subtitleColor='steelblue', subtitleFontWeight='bold')

genus = sorted(trees_df_age['genus_name'].unique())

dropdown_genus = alt.binding_select(name='Genus', options=genus)

select_genus = alt.selection_single(fields=['genus_name'], bind=dropdown_genus)

plot_3_genus_year_bar = alt.Chart(trees_df_age).mark_bar().encode(
    alt.X('year:N',title=None),
    alt.Y('count():Q',stack=False,title='Number of trees planted per genus'),
    alt.Color('genus_name:N',title='Genus name')
).add_selection(select_genus).encode(
    opacity=alt.condition(select_genus, alt.value(0.9), alt.value(0.0))
).properties(title=plot_3_title)

plot_3_genus_year_bar


Figure 3 with dropdown selection has combined multiple information required in one plot and provide the audiences with convenience to efficiently explore number of trees planted from 1989 to 2019 for each genus.

The answer to the question 2 would be valuable reference for the public who are interested with tree planting history in Vancouver. In addition, for researchers, the answer will provide a first-hand insight into the urban forestry and street trees replacement planning.

To be continued…