Shiny App - Mass Shootings in the USA

EDA with Shiny app on mass shootings between August 20th, 1982 and December 6th, 2023.
r
shiny
eda
data cleaning
data wrangling
Author

Sandra Jurela

Published

March 1, 2023

This post will be regularly updated with each new case.

Last update on January 5, 2024.


Introduction

Mass shootings have been a topic of intense discussion in the United States. A public “database” of mass shootings since 1982 has been made available by the Mother Jones, a non-profit organization. This “database” is stored in a Google spreadsheet. You can access it here and download as a CSV file.

There are many definitions of mass shooting. Here is what Britannica has to say:

Mass shooting, also called active shooter incident, as defined by the U.S. Federal Bureau of Investigation (FBI), an event in which one or more individuals are “actively engaged in killing or attempting to kill people in a populated area. Implicit in this definition is the shooter’s use of a firearm.” The FBI has not set a minimum number of casualties to qualify an event as a mass shooting, but U.S. statute (the Investigative Assistance for Violent Crimes Act of 2012) defines a “mass killing” as “3 or more killings in a single incident”.

Data overview

library(tidyverse)
library(tidygeocoder)
library(plotly)
theme_set(theme_classic())

mass_shootings <- read_csv("data/mass_shootings_usa_1982-2023.csv")

mass_shootings %>% glimpse()
Rows: 149
Columns: 24
$ case                             <chr> "UNLV shooting", "Maine bowling alley…
$ location...2                     <chr> "Las Vegas, Nevada", "Lewiston, Maine…
$ date                             <chr> "12/6/23", "10/25/23", "8/26/23", "8/…
$ summary                          <chr> "Anthony Polito, 67, a former univers…
$ fatalities                       <dbl> 3, 18, 3, 3, 5, 3, 8, 5, 6, 3, 7, 11,…
$ injured                          <dbl> 1, 13, 0, 6, 2, 6, 7, 8, 6, 5, 1, 10,…
$ total_victims                    <dbl> 4, 31, 3, 9, 7, 9, 15, 13, 12, 8, 8, …
$ location...8                     <chr> "School", "Other", "workplace", "Othe…
$ age_of_shooter                   <chr> "67", "40", "21", "59", "40", "18", "…
$ prior_signs_mental_health_issues <chr> "-", "yes", "yes", "-", "yes", "yes",…
$ mental_health_details            <chr> "-", "Card reportedly spoke of \"hear…
$ weapons_obtained_legally         <chr> "-", "-", "yes", "-", "yes", "yes", "…
$ where_obtained                   <chr> "-", "Yes", "local gun stores", "-", …
$ weapon_type                      <chr> "semiautomatic handgun", "semiautomat…
$ weapon_details                   <chr> "-", "AR-15-style rifle (Rugar SFAR)"…
$ race                             <chr> "White", "White", "White", "White", "…
$ gender                           <chr> "M", "M", "M", "M", "M", "M", "M", "M…
$ sources                          <chr> "https://www.nbcnews.com/news/us-news…
$ mental_health_sources            <chr> "-", "https://www.nytimes.com/article…
$ sources_additional_age           <chr> "-", "-", "-", "-", "-", "-", "-", "-…
$ latitude                         <chr> "-", "-", "-", "-", "-", "-", "-", "-…
$ longitude                        <chr> "-", "-", "-", "-", "-", "-", "-", "-…
$ type                             <chr> "mass", "Spree", "mass", "mass", "mas…
$ year                             <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2…

We have 149 cases, described with 24 variables. At first glance, this dataset clearly needs extensive cleaning.

Data cleaning

🧹 Step 1. Initial cleaning

The first cleaning step includes:

  • selecting columns of interest,
  • replacing the character value "-" with NA in all columns with character data type,
  • converting date column from character to date data type,
  • renaming location columns,
  • converting character data type to numeric for specific columns.
mass_shootings_cln <- mass_shootings %>% 
  select(1:6, 8:10, 12, 16, 17, 21:24) %>% 
  mutate(across(where(is.character), ~na_if(., "-"))) %>% 
  mutate(date = lubridate::mdy(date)) %>% 
  rename(location = location...2, location_2 = location...8) %>% 
  mutate_at(c("injured", "age_of_shooter", "latitude", "longitude"), as.numeric)
  
mass_shootings_cln %>% glimpse()
Rows: 149
Columns: 16
$ case                             <chr> "UNLV shooting", "Maine bowling alley…
$ location                         <chr> "Las Vegas, Nevada", "Lewiston, Maine…
$ date                             <date> 2023-12-06, 2023-10-25, 2023-08-26, …
$ summary                          <chr> "Anthony Polito, 67, a former univers…
$ fatalities                       <dbl> 3, 18, 3, 3, 5, 3, 8, 5, 6, 3, 7, 11,…
$ injured                          <dbl> 1, 13, 0, 6, 2, 6, 7, 8, 6, 5, 1, 10,…
$ location_2                       <chr> "School", "Other", "workplace", "Othe…
$ age_of_shooter                   <dbl> 67, 40, 21, 59, 40, 18, 33, 25, 28, 4…
$ prior_signs_mental_health_issues <chr> NA, "yes", "yes", NA, "yes", "yes", "…
$ weapons_obtained_legally         <chr> NA, NA, "yes", NA, "yes", "yes", "yes…
$ race                             <chr> "White", "White", "White", "White", "…
$ gender                           <chr> "M", "M", "M", "M", "M", "M", "M", "M…
$ latitude                         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ longitude                        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ type                             <chr> "mass", "Spree", "mass", "mass", "mas…
$ year                             <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2…

🔎 Are there any duplicates? No.

sum(duplicated(mass_shootings))
[1] 0

🔎 Number of missing values, NA, per column.

mass_shootings_cln %>% 
  summarise_all(~sum(is.na(.))) %>% 
  # transposing for better visibility
  pivot_longer(cols = everything(), names_to = "column", values_to = "n_missing")
# A tibble: 16 × 2
   column                           n_missing
   <chr>                                <int>
 1 case                                     0
 2 location                                 0
 3 date                                     0
 4 summary                                  0
 5 fatalities                               0
 6 injured                                  0
 7 location_2                               0
 8 age_of_shooter                           2
 9 prior_signs_mental_health_issues        30
10 weapons_obtained_legally                19
11 race                                    12
12 gender                                   0
13 latitude                                23
14 longitude                               23
15 type                                     0
16 year                                     0

23 of the most recent cases don’t have location coordinates at all. We’ll address this in the final cleanup step.

🧹 Step 2. Fixing unique values of categorical variables

🔎 Let’s take a look at the unique values of the gender column.

mass_shootings_cln %>% 
  count(gender, sort = TRUE) 
# A tibble: 6 × 2
  gender                                                                       n
  <chr>                                                                    <int>
1 "M"                                                                         73
2 "Male"                                                                      70
3 "Female"                                                                     2
4 "Male & Female"                                                              2
5 "F"                                                                          1
6 "F (\"identifies as transgender\" and \"Audrey Hale is a biological wom…     1

Almost all categorical variables need unique values correction.

To make a long story short, I’ll correct them all in one step using case_when function, and we’ll look at them later during the analysis.

mass_shootings_cln <- mass_shootings_cln %>% 
  mutate(gender = case_when(gender == "F" ~ "Female",
                            gender == "M" ~ "Male", 
                            gender == "(see summary)" ~ "Male",
                            gender %>% str_detect("transgender")~"Female (transgender)",
                            TRUE ~ gender),
         race = case_when(race == "white" ~ "White",
                          race == "black" ~ "Black",
                          race == "unclear" ~ "Unclear",
                          TRUE ~ race),
         location_2 = 
           case_when(location_2 %in% c("workplace", "\nWorkplace") ~ "Workplace",
                                location_2 == "Other\n" ~ "Other",
                                location_2 == "religious" ~ "Religious",
                                TRUE ~ location_2),
         prior_signs_mental_health_issues = 
           case_when(prior_signs_mental_health_issues == "yes" ~ "Yes",
                     prior_signs_mental_health_issues == "TBD" ~ "To be determined",
                     TRUE ~ prior_signs_mental_health_issues),
         weapons_obtained_legally = 
           case_when(weapons_obtained_legally %in% c("yes", "\nYes") ~ "Yes",
                     weapons_obtained_legally == "TBD" ~ "To be determined",
                     weapons_obtained_legally %>% str_detect("Kelley") ~ "Unknown",
                     weapons_obtained_legally %>% str_detect("some") ~ "Partially",
                     TRUE ~ weapons_obtained_legally), 
         type = case_when(type == "mass" ~ "Mass",
                          TRUE ~ type))

🧹 Step 3. Geocoding locations with missing coordinates

There are 23 cases with missing location coordinates. In this step we’ll convert locations to coordinates with geocoding. and use them later to create a leaflet map for a shiny app.

The tidygeocoder package provides geocoding services. It’s designed to work easily with the tidyverse. It also provides access to several different geocoding services, including LocationIQ which I’m going to use here. LocationIQ is a freemium service that provides a free tier, which doesn’t require you to give them your billing details. When you sign up to LocationIQ, they’ll take you to the Manage Your API Access Tokens page, which is where we obtain our API token. Next, you need to provide the tidygeocoder package with your API key.

You can also use the Nominatim (“osm”) geocoding service (OpenStreetMap) which can be specified with the method argument (method = "osm"). I found LocationIQ to be faster.

The first step is to select only locations with missing coordinates and geocode them.

geocoded_locations <- mass_shootings_cln %>% 
  filter(is.na(latitude) | is.na(longitude)) %>% 
  select(location) %>% 
  geocode(location, method = "iq")

geocoded_locations %>% 
  mutate(across(where(is.numeric), ~ num(., digits = 6)))
# A tibble: 23 × 3
   location                         lat        long
   <chr>                      <num:.6!>   <num:.6!>
 1 Las Vegas, Nevada          36.167256 -115.148516
 2 Lewiston, Maine            44.095108  -70.215545
 3 Jacksonville, Florida      30.332184  -81.655651
 4 Trabuco Canyon, California 33.662623 -117.589380
 5 Philadelphia, Pennsylvania 39.952724  -75.163526
 6 Farmington, New Mexico     36.729115 -108.205445
 7 Allen, Texas               33.103174  -96.670550
 8 Louisville, Kentucky       38.254238  -85.759407
 9 Nashville, Tennessee       36.162277  -86.774298
10 East Lansing, Michigan     42.732031  -84.472168
# ℹ 13 more rows

The next step is to join mass shootings table with geocoded locations and replace missing latitudes and longitudes with geocoded.

mass_shootings_cln <- mass_shootings_cln %>% 
  left_join(geocoded_locations, by = "location") %>% 
  mutate(latitude = ifelse(is.na(latitude), lat, latitude),
         longitude = ifelse(is.na(longitude), long, longitude))

🔎 Checking for null values.

sum(is.na(mass_shootings_cln$latitude))
[1] 0
sum(is.na(mass_shootings_cln$longitude))
[1] 0

OK, this looks fine.

Exploratory data analysis (EDA)

❕ Writing a function

To count unique values for all categorical variables separately, I’ll write a function, count_unique, to avoid copying and pasting a block of code several times.

Here we have a special case where we have to pass a dataframe column name (variable) to a function argument. The solution is to embrace the argument by surrounding it in doubled braces, like group_by({{ var }}).

count_unique <- function(data, var) {
  
  data %>%
    group_by({{ var }}) %>%    
    summarise(count = n(), .groups = "drop") %>% 
    mutate(percent = scales::percent(count/sum(count), accuracy = 0.1)) %>% 
    arrange(desc(count))

}

📄 Breakdown by categorical variables

Gender

count_unique(mass_shootings_cln, gender)
# A tibble: 4 × 3
  gender               count percent
  <chr>                <int> <chr>  
1 Male                   143 96.0%  
2 Female                   3 2.0%   
3 Male & Female            2 1.3%   
4 Female (transgender)     1 0.7%   

Race

count_unique(mass_shootings_cln, race) 
# A tibble: 8 × 3
  race            count percent
  <chr>           <int> <chr>  
1 White              80 53.7%  
2 Black              26 17.4%  
3 Latino             12 8.1%   
4 <NA>               12 8.1%   
5 Asian              10 6.7%   
6 Other               5 3.4%   
7 Native American     3 2.0%   
8 Unclear             1 0.7%   

Specific location

count_unique(mass_shootings_cln, location_2)
# A tibble: 6 × 3
  location_2 count percent
  <chr>      <int> <chr>  
1 Other         58 38.9%  
2 Workplace     53 35.6%  
3 School        23 15.4%  
4 Religious      8 5.4%   
5 Military       6 4.0%   
6 Airport        1 0.7%   

Prior signs of mental health issues

count_unique(mass_shootings_cln, prior_signs_mental_health_issues)
# A tibble: 6 × 3
  prior_signs_mental_health_issues count percent
  <chr>                            <int> <chr>  
1 Yes                                 72 48.3%  
2 <NA>                                30 20.1%  
3 Unclear                             24 16.1%  
4 No                                  17 11.4%  
5 To be determined                     5 3.4%   
6 Unknown                              1 0.7%   

Weapons obtained legally

count_unique(mass_shootings_cln, weapons_obtained_legally)
# A tibble: 6 × 3
  weapons_obtained_legally count percent
  <chr>                    <int> <chr>  
1 Yes                         99 66.4%  
2 <NA>                        19 12.8%  
3 No                          16 10.7%  
4 To be determined             7 4.7%   
5 Unknown                      7 4.7%   
6 Partially                    1 0.7%   

Type

count_unique(mass_shootings_cln, type)
# A tibble: 2 × 3
  type  count percent
  <chr> <int> <chr>  
1 Mass    127 85.2%  
2 Spree    22 14.8%  

Note: Spree shootings have three or more victims in a short time in multiple locations.

📊 Age of shooter distribution

Code for creating the age_group column
# create "age group" column
mass_shootings_cln <-  mass_shootings_cln %>% 
  mutate(age_group = case_when(
    age_of_shooter >= 10 & age_of_shooter <= 14 ~ "10-14",
    age_of_shooter <= 19 ~ "15-19",
    age_of_shooter <= 24 ~ "20-24",
    age_of_shooter <= 29 ~ "25-29",
    age_of_shooter <= 34 ~ "30-34",
    age_of_shooter <= 39 ~ "35-39",
    age_of_shooter <= 44 ~ "40-44",
    age_of_shooter <= 49 ~ "45-49",
    age_of_shooter <= 54 ~ "50-54",
    age_of_shooter <= 59 ~ "55-59",
    age_of_shooter <= 64 ~ "60-64",
    age_of_shooter <= 69 ~ "65-69",
    age_of_shooter <= 74 ~ "70-74"))
p1 <- mass_shootings_cln %>% 
  filter(!is.na(age_group)) %>% 
  group_by(age_group) %>% 
  summarise(count = n(), .groups = "drop") %>% 
  mutate(percent = scales::percent(count/sum(count), accuracy = 0.1)) %>% 
  mutate(label_text = str_glue("Age group: {age_group}
                               Count: {count}
                               Percent: {percent}")) %>%
  ggplot(aes(x = age_group, y = count, text = label_text)) +
  geom_col(width = 0.7, fill = "indianred") +
  labs(title = "Age Distribution", x = "age group") 

ggplotly(p1, tooltip = "text")


  • The vast majority of shooters were between 15 and 50 years old.
  • The age distribution is bimodal, with one mode around 23 years of age and a second mode around 41 years of age.
  • Most shooters were in the 20-24 age group (18.4 %), followed by 40-44 (17 %) and 25-29 (15.6 %).

🔎 Who was the youngest shooter?

mass_shootings_cln %>% 
  slice_min(age_of_shooter, n = 1) %>% 
  select(case, date, summary, fatalities) %>% 
  knitr::kable()
case date summary fatalities
Westside Middle School killings 1998-03-24 Mitchell Scott Johnson, 13, and Andrew Douglas Golden, 11, two juveniles, ambushed students and teachers as they left the school; they were apprehended by police at the scene. 5

📊 Number of cases per year

p2 <- mass_shootings_cln %>%
  group_by(year) %>%
  summarise(count = n()) %>% 
  ggplot(aes(year, count)) +
  geom_col(fill = "steelblue") + 
  geom_smooth(method = "loess", se = FALSE, color = "indianred", size = 0.7) +
  labs(title = "Number of Cases per Year") 

ggplotly(p2)


  • We can see an increase in mass shootings in the last 12 years.
  • 2020 has a smaller number of cases probably due to Covid restrictions.

📊 Fatalities-Injured relationship

p3 <- mass_shootings_cln %>%
  ggplot(aes(x = fatalities, y = injured)) +
  geom_jitter() +
  scale_y_sqrt() +
  labs(title = "Fatalities-Injured Relationship")
  
ggplotly(p3)


Please note that the Injured values are square root scaled for better visibility, but you can see the actual values by hovering over the points.

Summary of fatalities

summary(mass_shootings_cln$fatalities)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   4.000   6.000   7.711   8.000  58.000 

Summary of injured people

summary(mass_shootings_cln$injured)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    3.00   10.93   10.00  546.00 

📊 Total fatalities by state

🛠️ Data manipulation

# create us states with abbreviations tibble
states_with_abbr <- 
  tibble(state = state.name, abbr = state.abb) %>% 
  bind_rows(tibble(state = "District of Columbia", abbr = "DC"))

# data manipulation
by_state <- mass_shootings_cln %>% 
  # recode D.d. to District of Columbia
  mutate(location = ifelse(location == "Washington, D.C.", 
                           "Washington, District of Columbia", 
                           location)) %>% 
  # separate location into city and state
  separate(location, c("city", "state"), sep = ", ") %>% 
  # group and summarize
  group_by(state) %>% 
  summarise(total_cases = n(),
            total_fatalities = sum(fatalities), .groups = "drop") %>% 
  # add us states abbreviations
  left_join(states_with_abbr, by = "state") %>% 
  # rearrange columns
  select(state, abbr, everything())

📈 Top ten states regarding number of cases and fatalities

by_state %>% 
  arrange(-total_cases, -total_fatalities) %>% 
  head(10)
# A tibble: 10 × 4
   state        abbr  total_cases total_fatalities
   <chr>        <chr>       <int>            <dbl>
 1 California   CA             26              178
 2 Texas        TX             13              159
 3 Florida      FL             13              129
 4 Colorado     CO              8               53
 5 Washington   WA              7               37
 6 Pennsylvania PA              6               32
 7 New York     NY              5               40
 8 Wisconsin    WI              5               28
 9 Illinois     IL              5               25
10 Virginia     VA              4               53

📊 Total fatalities by state visualization

by_state %>% 
  plot_geo(locationmode = 'USA-states') %>% 
  add_trace(z = ~total_fatalities,
            locations = ~abbr,
            color = ~total_fatalities,
            colors = ~"Reds") %>% 
  layout(
    geo = list(
      scope = "usa",
      projection = list(type = "albers usa"),
      lakecolor = toRGB("white")
    )
  )

Shiny app

The app you can see below is embedded in this quarto document since my website is static. It was originally published on shinyapps.io, where you can also interact with it.

Note: If you don’t see the application, I’ve run out of my 25 active hours (when my applications are not idle). Sorry, this is a free account, and my app will not be available again until the following month cycle. Hope you get lucky! 😊

📢 By clicking on each circle, you can read a summary of the mass shooting case.


Thanks for reading!