Sandra Jurela - Test

Warning

This is a notebook about Vancouver street trees that is not yet finished!!

What do you think about it?

Imports

# Import libraries 
import altair as alt
import pandas as pd
import json

# embed the dataset within the visualization specification
alt.data_transformers.disable_max_rows()

trees_df = pd.read_csv('data/vancouver_trees.csv')

# Glance at the original df
trees_df

	std_street	on_street	species_name	neighbourhood_name	date_planted	diameter	street_side_name	genus_name	assigned	civic_number	plant_area	curb	tree_id	common_name	height_range_id	on_street_block	cultivar_name	root_barrier	latitude	longitude
0	W 13TH AV	MAPLE ST	PSEUDOPLATANUS	Kitsilano	NaN	9.00	EVEN	ACER	N	1996	10	Y	13310	SYCAMORE MAPLE	4	2900	NaN	N	49.259856	-123.150586
1	WALES ST	WALES ST	PLATANOIDES	Renfrew-Collingwood	2018-11-28	3.00	ODD	ACER	N	5291	7	Y	259084	PRINCETON GOLD MAPLE	1	5200	PRINCETON GOLD	N	49.236650	-123.051831
2	W BROADWAY	W BROADWAY	RUBRUM	Kitsilano	1996-04-19	14.00	EVEN	ACER	N	3618	C	Y	167986	KARPICK RED MAPLE	3	3600	KARPICK	N	49.264250	-123.184020
3	PENTICTON ST	PENTICTON ST	CALLERYANA	Renfrew-Collingwood	2006-03-06	3.75	EVEN	PYRUS	N	2502	5	Y	213386	CHANTICLEER PEAR	1	2500	CHANTICLEER	Y	49.261036	-123.052921
4	RHODES ST	RHODES ST	GLYPTOSTROBOIDES	Renfrew-Collingwood	2001-11-01	3.00	ODD	METASEQUOIA	N	5639	N	Y	189223	DAWN REDWOOD	2	5600	NaN	N	49.233354	-123.050249
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
29995	ROBSON ST	ROBSON ST	CAMPESTRE	West End	NaN	7.00	ODD	ACER	N	1015	c	Y	122814	HEDGE MAPLE	2	1000	NaN	N	49.283666	-123.123231
29996	OSLER ST	CONNAUGHT DRIVE	PLATANOIDES	Shaughnessy	2007-04-16	8.00	ODD	ACER	N	4690	10	Y	132211	NORWAY MAPLE	1	1000	NaN	Y	49.243636	-123.129480
29997	BEATRICE ST	BEATRICE ST	CERASIFERA	Victoria-Fraserview	NaN	17.30	EVEN	PRUNUS	N	6218	9	Y	59355	PISSARD PLUM	3	6200	ATROPURPUREUM	N	49.227406	-123.066936
29998	ANGUS DRIVE	ANGUS DRIVE	BILOBA	Shaughnessy	2006-02-17	4.00	ODD	GINKGO	N	1551	9	Y	207753	GINKGO OR MAIDENHAIR TREE	1	1500	NaN	Y	49.254431	-123.140382
29999	MAIN ST	MAIN ST	EUCHLORA X	Riley Park	NaN	12.00	ODD	TILIA	N	4323	C	Y	92997	CRIMEAN LINDEN	4	4300	NaN	N	49.246969	-123.101328

30000 rows × 20 columns

Identify and drop irrelevant columns

# Check columns of the original df
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   std_street          30000 non-null  object 
 1   on_street           30000 non-null  object 
 2   species_name        30000 non-null  object 
 3   neighbourhood_name  30000 non-null  object 
 4   date_planted        14085 non-null  object 
 5   diameter            30000 non-null  float64
 6   street_side_name    30000 non-null  object 
 7   genus_name          30000 non-null  object 
 8   assigned            30000 non-null  object 
 9   civic_number        30000 non-null  int64  
 10  plant_area          29722 non-null  object 
 11  curb                30000 non-null  object 
 12  tree_id             30000 non-null  int64  
 13  common_name         30000 non-null  object 
 14  height_range_id     30000 non-null  int64  
 15  on_street_block     30000 non-null  int64  
 16  cultivar_name       16178 non-null  object 
 17  root_barrier        30000 non-null  object 
 18  latitude            30000 non-null  float64
 19  longitude           30000 non-null  float64
dtypes: float64(3), int64(4), object(13)
memory usage: 4.6+ MB

Based on the above data information and the dataset schema from City of Vancouver Open Data Portal - Street Trees, the columns can be preliminarily identified as four groups:

Trees biological classifications and names, such as genus_name, species_name, common_name, cultivar_name
Trees growth related characteristics, such as date_planted, diameter, height_range_id
Trees coordinates and areas, such as latitude, longitude, neighbourhood_name
Other specific location / orientation / identification information

As per the questions, the irrelevant columns under the fourth group and trees coordinates will be dropped. Also, to narrow down the focus to the highest level of tree classification, the columns of species_name, common_name, cultivar_name will also be dropped and only keep the column of genus_name.

trees_df = pd.read_csv('data/vancouver_trees.csv',
                       usecols=['neighbourhood_name',
                                'date_planted',
                                'diameter', 
                                'genus_name',
                                'height_range_id'],
                       parse_dates=['date_planted'])
                                    
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   neighbourhood_name  30000 non-null  object        
 1   date_planted        14085 non-null  datetime64[ns]
 2   diameter            30000 non-null  float64       
 3   genus_name          30000 non-null  object        
 4   height_range_id     30000 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 1.1+ MB

Deal with null values

Based on the df information, the column of date_planted has almost half of values missed. Since reviewing trees growth is one of the objectives in this analysis, any observations without date of planted will be considered as uninformative data and dropped from the original df.

This will not cause a problem with the representativeness of the data. Just trust me for now! :)

# Drop observations without value of date_planted
trees_df = trees_df.dropna(subset=['date_planted'])

trees_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14085 entries, 1 to 29998
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   neighbourhood_name  14085 non-null  object        
 1   date_planted        14085 non-null  datetime64[ns]
 2   diameter            14085 non-null  float64       
 3   genus_name          14085 non-null  object        
 4   height_range_id     14085 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 660.2+ KB

Further exam for error data

# Exam the remaining columns for more details
trees_df.describe(include='all', datetime_is_numeric=True)

	neighbourhood_name	date_planted	diameter	genus_name	height_range_id
count	14085	14085	14085.000000	14085	14085.000000
unique	22	NaN	NaN	68	NaN
top	Renfrew-Collingwood	NaN	NaN	ACER	NaN
freq	1323	NaN	NaN	3970	NaN
mean	NaN	2003-09-20 17:40:42.172523904	6.352586	NaN	1.822932
min	NaN	1989-10-27 00:00:00	0.000000	NaN	0.000000
25%	NaN	1997-12-15 00:00:00	3.000000	NaN	1.000000
50%	NaN	2003-04-01 00:00:00	5.000000	NaN	2.000000
75%	NaN	2009-11-13 00:00:00	8.000000	NaN	2.000000
max	NaN	2019-06-03 00:00:00	317.000000	NaN	9.000000
std	NaN	NaN	5.273737	NaN	0.983476

It has been noticed that the minimum of diameter and height_range_id are zero. For height_range_id, 0 represents the range of height is from 0 to 10 ft. However, since the diameter figure is the diameter of tree at breast height, it should not have the value of 0. So these observations will be considered as invalid data and removed from the df.

indexs = trees_df[trees_df['diameter'] == 0].index

trees_df.drop(indexs, inplace=True)

trees_df.describe(include='all', datetime_is_numeric=True)

	neighbourhood_name	date_planted	diameter	genus_name	height_range_id
count	14083	14083	14083.000000	14083	14083.000000
unique	22	NaN	NaN	68	NaN
top	Renfrew-Collingwood	NaN	NaN	ACER	NaN
freq	1323	NaN	NaN	3970	NaN
mean	NaN	2003-09-20 23:57:38.893701504	6.353489	NaN	1.822978
min	NaN	1989-10-27 00:00:00	0.500000	NaN	0.000000
25%	NaN	1997-12-15 12:00:00	3.000000	NaN	1.000000
50%	NaN	2003-04-01 00:00:00	5.000000	NaN	2.000000
75%	NaN	2009-11-13 00:00:00	8.000000	NaN	2.000000
max	NaN	2019-06-03 00:00:00	317.000000	NaN	9.000000
std	NaN	NaN	5.273568	NaN	0.983520

Dataset Description

The cleaned target dataframe trees_df is composed of 5 columns. There are 68 distinct genera within the total of 14,083 trees (observations). According to City of Vancouver Open Data Portal - Street Trees where the dataset was originally obtained, the brief descriptions of columns are listed as below:

Categorical columns

neighbourhood_name: City’s defined local area in which the tree is located.

genus_name: Genus name of trees.

Quantitative columns

diameter: DBH in inches (DBH stands for diameter of tree at breast height).

height_range_id: 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and10 = 100+ ft).

Datetime columns

date_planted: The date of planting.

Q1

Based on distribution of street trees planted in Vancouver by genus, which genus is the most popular one?

Biological diversity is one of the interests for the analysis. The quick guess for the answer would be Acer as Acer is a genus of trees commonly known as maples. A maple leaf is on the Canadian flag, and the maple has been chosen as a national symbol in Canada.

# To visualize distrubition of genus for whole df
plot_1_title = alt.TitleParams(
    "Figure 1 Number of street trees planted per genus",
     subtitle = "Acer is the most popular genus of street trees")

plot_1_genus = alt.Chart(trees_df).mark_bar().encode(
    alt.X('count():Q',title='Number of Trees'),
    alt.Y('genus_name:N',title='Genus',sort='-x')
)

# Add text annotation for the number of trees for each genus
text_1_genus = plot_1_genus.mark_text(align='left',dx=2).encode(text='count():Q')

plot_1_genus = (plot_1_genus + text_1_genus).properties(title=plot_1_title,width=550)

plot_1_genus

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

From Figure 1, it has been confirmed that Acer is the most popular genus of trees based on the dataframe. The number of Acer trees is more than double of Prunus trees that have the second largest number in the Vancouver street trees. All the rest of the genera have less than 1,000 trees shown in the dataframe, and 47 out of the total 68 genera have less than 100 trees.

Q2

In the past 30 years, how many trees have been planted every single year? And what is the number of trees planted by genus each year?

The dataframe has provide the specific date of being planted for each tree. From the available data, the age of current trees will be calculated first and then plot the number of trees planted by year. Since the accuracy is not a priority in this case, only the year of tree planted will be extracted for calculating the age till 2021.

# Extract year of planted and calculate age of trees
trees_df_yr = trees_df.assign(year=trees_df['date_planted'].dt.year)

# Calculate the age of trees till 2021.
trees_df_age = trees_df_yr.assign(age=(2021-trees_df_yr['year']))

# Plot distribution of number of trees planted by year
plot_2_title = alt.TitleParams(
    "Figure 2 Number of street trees planted each year",
     subtitle = "(Data available from 1989 to 2019)")

plot_2_year = alt.Chart(trees_df_age).mark_bar().encode(
    alt.X('year:N',title=None),
    alt.Y('count():Q',title='Number of trees planted')).properties(title=plot_2_title)

plot_2_year

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Figure 2 indicates that there was a peak period between 1995 and 2013 to plant street trees by the City of Vancouver. During this period, the city had planted the highest number of trees in a single year in 1998 and in 2013. Before 1995 and after 2014, the number of trees planted was relatively lower, especially in 2016 when there were less than 50 new trees on public boulevards in Vancouver. Urban forestry is a systemic project. How many trees are planted is determined by a group of factors, such as public budget, tree replacement plan due to species distribution, insects, diseases, or environmental stress, etc. It is a clear message that the City of Vancouver has maintained a dynamic public trees planting program that benefits the wellbeings of residents in Vancouver.

Moving further, to plot number of trees planted by year and add dropdown selection by genus.

# Specify the subtitle color and bold it to draw attention
plot_3_title = alt.TitleParams(
    "Figure 3 Number of street trees planted each year by genus(from 1989 to 2019)",
    subtitle = "Dropdown selection is available by genus",
    subtitleColor='steelblue', subtitleFontWeight='bold')

genus = sorted(trees_df_age['genus_name'].unique())

dropdown_genus = alt.binding_select(name='Genus', options=genus)

select_genus = alt.selection_single(fields=['genus_name'], bind=dropdown_genus)

plot_3_genus_year_bar = alt.Chart(trees_df_age).mark_bar().encode(
    alt.X('year:N',title=None),
    alt.Y('count():Q',stack=False,title='Number of trees planted per genus'),
    alt.Color('genus_name:N',title='Genus name')
).add_selection(select_genus).encode(
    opacity=alt.condition(select_genus, alt.value(0.9), alt.value(0.0))
).properties(title=plot_3_title)

plot_3_genus_year_bar

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Figure 3 with dropdown selection has combined multiple information required in one plot and provide the audiences with convenience to efficiently explore number of trees planted from 1989 to 2019 for each genus.

The answer to the question 2 would be valuable reference for the public who are interested with tree planting history in Vancouver. In addition, for researchers, the answer will provide a first-hand insight into the urban forestry and street trees replacement planning.

To be continued…