This is a notebook about Vancouver street trees that is not yet finished!!
What do you think about it?
Imports
# Import libraries import altair as altimport pandas as pdimport json# embed the dataset within the visualization specificationalt.data_transformers.disable_max_rows()trees_df = pd.read_csv('data/vancouver_trees.csv')# Glance at the original dftrees_df
Trees biological classifications and names, such as genus_name, species_name, common_name, cultivar_name
Trees growth related characteristics, such as date_planted, diameter, height_range_id
Trees coordinates and areas, such as latitude, longitude, neighbourhood_name
Other specific location / orientation / identification information
As per the questions, the irrelevant columns under the fourth group and trees coordinates will be dropped. Also, to narrow down the focus to the highest level of tree classification, the columns of species_name, common_name, cultivar_name will also be dropped and only keep the column of genus_name.
Based on the df information, the column of date_planted has almost half of values missed. Since reviewing trees growth is one of the objectives in this analysis, any observations without date of planted will be considered as uninformative data and dropped from the original df.
This will not cause a problem with the representativeness of the data. Just trust me for now! :)
# Drop observations without value of date_plantedtrees_df = trees_df.dropna(subset=['date_planted'])trees_df.info()
# Exam the remaining columns for more detailstrees_df.describe(include='all', datetime_is_numeric=True)
neighbourhood_name
date_planted
diameter
genus_name
height_range_id
count
14085
14085
14085.000000
14085
14085.000000
unique
22
NaN
NaN
68
NaN
top
Renfrew-Collingwood
NaN
NaN
ACER
NaN
freq
1323
NaN
NaN
3970
NaN
mean
NaN
2003-09-20 17:40:42.172523904
6.352586
NaN
1.822932
min
NaN
1989-10-27 00:00:00
0.000000
NaN
0.000000
25%
NaN
1997-12-15 00:00:00
3.000000
NaN
1.000000
50%
NaN
2003-04-01 00:00:00
5.000000
NaN
2.000000
75%
NaN
2009-11-13 00:00:00
8.000000
NaN
2.000000
max
NaN
2019-06-03 00:00:00
317.000000
NaN
9.000000
std
NaN
NaN
5.273737
NaN
0.983476
It has been noticed that the minimum of diameter and height_range_id are zero. For height_range_id, 0 represents the range of height is from 0 to 10 ft. However, since the diameter figure is the diameter of tree at breast height, it should not have the value of 0. So these observations will be considered as invalid data and removed from the df.
The cleaned target dataframe trees_df is composed of 5 columns. There are 68 distinct genera within the total of 14,083 trees (observations). According to City of Vancouver Open Data Portal - Street Trees where the dataset was originally obtained, the brief descriptions of columns are listed as below:
Categorical columns
neighbourhood_name: City’s defined local area in which the tree is located.
genus_name: Genus name of trees.
Quantitative columns
diameter: DBH in inches (DBH stands for diameter of tree at breast height).
Based on distribution of street trees planted in Vancouver by genus, which genus is the most popular one?
Biological diversity is one of the interests for the analysis. The quick guess for the answer would be Acer as Acer is a genus of trees commonly known as maples. A maple leaf is on the Canadian flag, and the maple has been chosen as a national symbol in Canada.
# To visualize distrubition of genus for whole dfplot_1_title = alt.TitleParams("Figure 1 Number of street trees planted per genus", subtitle ="Acer is the most popular genus of street trees")plot_1_genus = alt.Chart(trees_df).mark_bar().encode( alt.X('count():Q',title='Number of Trees'), alt.Y('genus_name:N',title='Genus',sort='-x'))# Add text annotation for the number of trees for each genustext_1_genus = plot_1_genus.mark_text(align='left',dx=2).encode(text='count():Q')plot_1_genus = (plot_1_genus + text_1_genus).properties(title=plot_1_title,width=550)plot_1_genus
From Figure 1, it has been confirmed that Acer is the most popular genus of trees based on the dataframe. The number of Acer trees is more than double of Prunus trees that have the second largest number in the Vancouver street trees. All the rest of the genera have less than 1,000 trees shown in the dataframe, and 47 out of the total 68 genera have less than 100 trees.
Q2
In the past 30 years, how many trees have been planted every single year? And what is the number of trees planted by genus each year?
The dataframe has provide the specific date of being planted for each tree. From the available data, the age of current trees will be calculated first and then plot the number of trees planted by year. Since the accuracy is not a priority in this case, only the year of tree planted will be extracted for calculating the age till 2021.
# Extract year of planted and calculate age of treestrees_df_yr = trees_df.assign(year=trees_df['date_planted'].dt.year)# Calculate the age of trees till 2021.trees_df_age = trees_df_yr.assign(age=(2021-trees_df_yr['year']))# Plot distribution of number of trees planted by yearplot_2_title = alt.TitleParams("Figure 2 Number of street trees planted each year", subtitle ="(Data available from 1989 to 2019)")plot_2_year = alt.Chart(trees_df_age).mark_bar().encode( alt.X('year:N',title=None), alt.Y('count():Q',title='Number of trees planted')).properties(title=plot_2_title)plot_2_year
Figure 2 indicates that there was a peak period between 1995 and 2013 to plant street trees by the City of Vancouver. During this period, the city had planted the highest number of trees in a single year in 1998 and in 2013. Before 1995 and after 2014, the number of trees planted was relatively lower, especially in 2016 when there were less than 50 new trees on public boulevards in Vancouver. Urban forestry is a systemic project. How many trees are planted is determined by a group of factors, such as public budget, tree replacement plan due to species distribution, insects, diseases, or environmental stress, etc. It is a clear message that the City of Vancouver has maintained a dynamic public trees planting program that benefits the wellbeings of residents in Vancouver.
Moving further, to plot number of trees planted by year and add dropdown selection by genus.
# Specify the subtitle color and bold it to draw attentionplot_3_title = alt.TitleParams("Figure 3 Number of street trees planted each year by genus(from 1989 to 2019)", subtitle ="Dropdown selection is available by genus", subtitleColor='steelblue', subtitleFontWeight='bold')genus =sorted(trees_df_age['genus_name'].unique())dropdown_genus = alt.binding_select(name='Genus', options=genus)select_genus = alt.selection_single(fields=['genus_name'], bind=dropdown_genus)plot_3_genus_year_bar = alt.Chart(trees_df_age).mark_bar().encode( alt.X('year:N',title=None), alt.Y('count():Q',stack=False,title='Number of trees planted per genus'), alt.Color('genus_name:N',title='Genus name')).add_selection(select_genus).encode( opacity=alt.condition(select_genus, alt.value(0.9), alt.value(0.0))).properties(title=plot_3_title)plot_3_genus_year_bar
Figure 3 with dropdown selection has combined multiple information required in one plot and provide the audiences with convenience to efficiently explore number of trees planted from 1989 to 2019 for each genus.
The answer to the question 2 would be valuable reference for the public who are interested with tree planting history in Vancouver. In addition, for researchers, the answer will provide a first-hand insight into the urban forestry and street trees replacement planning.