Shiny App - Mass Shootings in the USA

EDA with Shiny app on mass shootings between August 20th, 1982 and December 31st, 2025.
r
shiny
eda
data cleaning
data wrangling
Author

Sandra Jurela

Published

March 1, 2023

This post will be regularly updated with each new case.

Last update on January 22, 2026.


Introduction

Mass shootings have been a topic of intense discussion in the United States. A public “database” of mass shootings since 1982 has been made available by the Mother Jones, a non-profit organization. This “database” is stored in a Google spreadsheet. You can access it here and download as a CSV file.

There are many definitions of mass shooting. Here is what Britannica has to say:

Mass shooting, also called active shooter incident, as defined by the U.S. Federal Bureau of Investigation (FBI), an event in which one or more individuals are “actively engaged in killing or attempting to kill people in a populated area. Implicit in this definition is the shooter’s use of a firearm.” The FBI has not set a minimum number of casualties to qualify an event as a mass shooting, but U.S. statute (the Investigative Assistance for Violent Crimes Act of 2012) defines a “mass killing” as “3 or more killings in a single incident”.

Data overview

library(tidyverse)
library(tidygeocoder)
library(plotly)
theme_set(theme_classic())

mass_shootings <- read_csv("data/mass_shootings_usa_1982-2025.csv")

mass_shootings %>% glimpse()
Rows: 158
Columns: 24
$ case                             <chr> "Brown University and MIT shooting", …
$ location...2                     <chr> "Providence, Rhode Island", "Grand Bl…
$ date                             <chr> "12/13/2025", "9/28/2025", "9/27/2025…
$ summary                          <chr> "Claudio Neves Valente, 48, a former …
$ fatalities                       <dbl> 3, 4, 3, 3, 4, 4, 3, 4, 4, 3, 18, 3, …
$ injured                          <dbl> 9, 8, 8, 0, 0, 1, 3, 9, 10, 1, 13, 0,…
$ total_victims                    <dbl> 12, 12, 11, 3, 4, 5, 6, 13, 14, 4, 31…
$ location...8                     <chr> "school", "Religious", "Other", "othe…
$ age_of_shooter                   <chr> "48", "40", "39", "32", "45", "27", "…
$ prior_signs_mental_health_issues <chr> "yes", "yes", "yes", "Yes", "Yes", "Y…
$ mental_health_details            <chr> "-", "-", "The suspect had a history …
$ weapons_obtained_legally         <chr> "-", "-", "-", "-", "Yes", "Yes", "-"…
$ where_obtained                   <chr> "-", "-", "-", "-", "-", "-", "-", "-…
$ weapon_type                      <chr> "semiautomatic handgun", "semiautomat…
$ weapon_details                   <chr> "-", "-", "short-barreled AR-style ri…
$ race                             <chr> "white", "white", "white", "-", "whit…
$ gender                           <chr> "M", "M", "M", "M", "M", "M", "M", "M…
$ sources                          <chr> "https://www.nytimes.com/2025/12/19/u…
$ mental_health_sources            <chr> "https://www.washingtonpost.com/natio…
$ sources_additional_age           <chr> "-", "-", "-", "-", "-", "-", "-", "-…
$ latitude                         <chr> "-", "-", "-", "-", "-", "-", "-", "-…
$ longitude                        <chr> "-", "-", "-", "-", "-", "-", "-", "-…
$ type                             <chr> "Spree", "Mass", "Mass", "Mass", "Mas…
$ year                             <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2…

We have 158 cases, described with 24 variables. At first glance, this dataset clearly needs extensive cleaning.

Data cleaning

🧹 Step 1. Initial cleaning

The first cleaning step includes:

  • selecting columns of interest,
  • replacing the character value "-" with NA in all columns with character data type,
  • converting date column from character to date data type,
  • renaming location columns,
  • converting character data type to numeric for specific columns.
mass_shootings_cln <- mass_shootings %>% 
  select(1:6, 8:10, 12, 16, 17, 21:24) %>% 
  mutate(across(where(is.character), ~na_if(., "-"))) %>% 
  mutate(date = lubridate::mdy(date)) %>% 
  rename(location = location...2, location_2 = location...8) %>% 
  mutate_at(c("injured", "age_of_shooter", "latitude", "longitude"), as.numeric)
  
mass_shootings_cln %>% glimpse()
Rows: 158
Columns: 16
$ case                             <chr> "Brown University and MIT shooting", …
$ location                         <chr> "Providence, Rhode Island", "Grand Bl…
$ date                             <date> 2025-12-13, 2025-09-28, 2025-09-27, …
$ summary                          <chr> "Claudio Neves Valente, 48, a former …
$ fatalities                       <dbl> 3, 4, 3, 3, 4, 4, 3, 4, 4, 3, 18, 3, …
$ injured                          <dbl> 9, 8, 8, 0, 0, 1, 3, 9, 10, 1, 13, 0,…
$ location_2                       <chr> "school", "Religious", "Other", "othe…
$ age_of_shooter                   <dbl> 48, 40, 39, 32, 45, 27, 26, 14, 44, 6…
$ prior_signs_mental_health_issues <chr> "yes", "yes", "yes", "Yes", "Yes", "Y…
$ weapons_obtained_legally         <chr> NA, NA, NA, NA, "Yes", "Yes", NA, NA,…
$ race                             <chr> "white", "white", "white", NA, "white…
$ gender                           <chr> "M", "M", "M", "M", "M", "M", "M", "M…
$ latitude                         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ longitude                        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ type                             <chr> "Spree", "Mass", "Mass", "Mass", "Mas…
$ year                             <dbl> 2025, 2025, 2025, 2025, 2025, 2025, 2…

🔎 Are there any duplicates? No.

sum(duplicated(mass_shootings))
[1] 0

🔎 Number of missing values, NA, per column.

mass_shootings_cln %>% 
  summarise_all(~sum(is.na(.))) %>% 
  # transposing for better visibility
  pivot_longer(cols = everything(), names_to = "column", values_to = "n_missing")
# A tibble: 16 × 2
   column                           n_missing
   <chr>                                <int>
 1 case                                     0
 2 location                                 0
 3 date                                     0
 4 summary                                  0
 5 fatalities                               0
 6 injured                                  0
 7 location_2                               0
 8 age_of_shooter                           2
 9 prior_signs_mental_health_issues        31
10 weapons_obtained_legally                26
11 race                                    14
12 gender                                   0
13 latitude                                32
14 longitude                               32
15 type                                     0
16 year                                     0

32 of the most recent cases don’t have location coordinates at all. We’ll address this in the final cleanup step.

🧹 Step 2. Fixing unique values of categorical variables

🔎 Let’s take a look at the unique values of the gender column.

mass_shootings_cln %>% 
  count(gender, sort = TRUE) 
# A tibble: 6 × 2
  gender                                                                       n
  <chr>                                                                    <int>
1 "M"                                                                         82
2 "Male"                                                                      70
3 "Female"                                                                     2
4 "Male & Female"                                                              2
5 "F"                                                                          1
6 "F (\"identifies as transgender\" and \"Audrey Hale is a biological wom…     1

Almost all categorical variables need unique values correction.

To make a long story short, I’ll correct them all in one step using case_when function, and we’ll look at them later during the analysis.

mass_shootings_cln <- mass_shootings_cln %>% 
  mutate(gender = case_when(gender == "F" ~ "Female",
                            gender == "M" ~ "Male", 
                            gender == "(see summary)" ~ "Male",
                            gender %>% str_detect("transgender")~"Female (transgender)",
                            TRUE ~ gender),
         race = case_when(race == "white" ~ "White",
                          race == "black" ~ "Black",
                          race == "unclear" ~ "Unclear",
                          TRUE ~ race),
         location_2 = 
           case_when(location_2 %in% c("workplace", "\nWorkplace") ~ "Workplace",
                                location_2 %in% c("Other\n", "other") ~ "Other",
                                location_2 == "school" ~ "School",
                                location_2 == "religious" ~ "Religious",
                                TRUE ~ location_2),
         prior_signs_mental_health_issues = 
           case_when(prior_signs_mental_health_issues == "yes" ~ "Yes",
                     prior_signs_mental_health_issues == "TBD" ~ "To be determined",
                     TRUE ~ prior_signs_mental_health_issues),
         weapons_obtained_legally = 
           case_when(weapons_obtained_legally %in% c("yes", "\nYes") ~ "Yes",
                     weapons_obtained_legally == "TBD" ~ "To be determined",
                     weapons_obtained_legally %>% str_detect("Kelley") ~ "Unknown",
                     weapons_obtained_legally %>% str_detect("some") ~ "Partially",
                     TRUE ~ weapons_obtained_legally), 
         type = case_when(type == "mass" ~ "Mass",
                          TRUE ~ type))

🧹 Step 3. Geocoding locations with missing coordinates

There are 32 cases with missing location coordinates. In this step we’ll convert locations to coordinates with geocoding. and use them later to create a leaflet map for a shiny app.

The tidygeocoder package provides geocoding services. It’s designed to work easily with the tidyverse. It also provides access to several different geocoding services, including LocationIQ which I’m going to use here. LocationIQ is a freemium service that provides a free tier, which doesn’t require you to give them your billing details. When you sign up to LocationIQ, they’ll take you to the Manage Your API Access Tokens page, which is where we obtain our API token. Next, you need to provide the tidygeocoder package with your API key.

You can also use the Nominatim (“osm”) geocoding service (OpenStreetMap) which can be specified with the method argument (method = "osm"). I found LocationIQ to be faster.

The first step is to select only locations with missing coordinates and geocode them.

geocoded_locations <- mass_shootings_cln %>% 
  filter(is.na(latitude) | is.na(longitude)) %>% 
  select(location) %>% 
  geocode(location, method = "iq")

geocoded_locations %>% 
  mutate(across(where(is.numeric), ~ num(., digits = 6)))
# A tibble: 32 × 3
   location                                    lat        long
   <chr>                                 <num:.6!>   <num:.6!>
 1 Providence, Rhode Island              40.104680  -75.468655
 2 Grand Blanc, Michigan                 42.927528  -83.629952
 3 Southport Yacht Basin, North Carolina 34.399433  -77.642468
 4 Austin, Texas                         30.271129  -97.743700
 5 Anaconda, Montana                     46.129468 -112.953131
 6 Manhattan, New York                   40.757955  -73.985532
 7 Reno, Nevada                          39.526179 -119.812658
 8 Winder, Georgia                       33.991908  -83.718437
 9 Fordyce, Arkansas                     33.813716  -92.412930
10 Las Vegas, Nevada                     36.167426 -115.148413
# ℹ 22 more rows

The next step is to join mass shootings table with geocoded locations and replace missing latitudes and longitudes with geocoded.

mass_shootings_cln <- mass_shootings_cln %>% 
  left_join(geocoded_locations, by = "location") %>% 
  mutate(latitude = ifelse(is.na(latitude), lat, latitude),
         longitude = ifelse(is.na(longitude), long, longitude))

🔎 Checking for null values.

sum(is.na(mass_shootings_cln$latitude))
[1] 0
sum(is.na(mass_shootings_cln$longitude))
[1] 0

OK, this looks fine.

Exploratory data analysis (EDA)

❕ Writing a function

To count unique values for all categorical variables separately, I’ll write a function, count_unique, to avoid copying and pasting a block of code several times.

Here we have a special case where we have to pass a dataframe column name (variable) to a function argument. The solution is to embrace the argument by surrounding it in doubled braces, like group_by({{ var }}).

count_unique <- function(data, var) {
  
  data %>%
    group_by({{ var }}) %>%    
    summarise(count = n(), .groups = "drop") %>% 
    mutate(percent = scales::percent(count/sum(count), accuracy = 0.1)) %>% 
    arrange(desc(count))

}

📄 Breakdown by categorical variables

Gender

count_unique(mass_shootings_cln, gender)
# A tibble: 4 × 3
  gender               count percent
  <chr>                <int> <chr>  
1 Male                   152 96.2%  
2 Female                   3 1.9%   
3 Male & Female            2 1.3%   
4 Female (transgender)     1 0.6%   

Race

count_unique(mass_shootings_cln, race) 
# A tibble: 8 × 3
  race            count percent
  <chr>           <int> <chr>  
1 White              87 55.1%  
2 Black              26 16.5%  
3 <NA>               14 8.9%   
4 Latino             12 7.6%   
5 Asian              10 6.3%   
6 Other               5 3.2%   
7 Native American     3 1.9%   
8 Unclear             1 0.6%   

Specific location

count_unique(mass_shootings_cln, location_2)
# A tibble: 6 × 3
  location_2 count percent
  <chr>      <int> <chr>  
1 Other         60 38.0%  
2 Workplace     57 36.1%  
3 School        25 15.8%  
4 Religious      9 5.7%   
5 Military       6 3.8%   
6 Airport        1 0.6%   

Prior signs of mental health issues

count_unique(mass_shootings_cln, prior_signs_mental_health_issues)
# A tibble: 6 × 3
  prior_signs_mental_health_issues count percent
  <chr>                            <int> <chr>  
1 Yes                                 80 50.6%  
2 <NA>                                31 19.6%  
3 Unclear                             24 15.2%  
4 No                                  18 11.4%  
5 To be determined                     4 2.5%   
6 Unknown                              1 0.6%   

Weapons obtained legally

count_unique(mass_shootings_cln, weapons_obtained_legally)
# A tibble: 6 × 3
  weapons_obtained_legally count percent
  <chr>                    <int> <chr>  
1 Yes                        101 63.9%  
2 <NA>                        26 16.5%  
3 No                          16 10.1%  
4 To be determined             7 4.4%   
5 Unknown                      7 4.4%   
6 Partially                    1 0.6%   

Type

count_unique(mass_shootings_cln, type)
# A tibble: 2 × 3
  type  count percent
  <chr> <int> <chr>  
1 Mass    135 85.4%  
2 Spree    23 14.6%  

Note: Spree shootings have three or more victims in a short time in multiple locations.

📊 Age of shooter distribution

Code for creating the age_group column
# create "age group" column
mass_shootings_cln <-  mass_shootings_cln %>% 
  mutate(age_group = case_when(
    age_of_shooter >= 10 & age_of_shooter <= 14 ~ "10-14",
    age_of_shooter <= 19 ~ "15-19",
    age_of_shooter <= 24 ~ "20-24",
    age_of_shooter <= 29 ~ "25-29",
    age_of_shooter <= 34 ~ "30-34",
    age_of_shooter <= 39 ~ "35-39",
    age_of_shooter <= 44 ~ "40-44",
    age_of_shooter <= 49 ~ "45-49",
    age_of_shooter <= 54 ~ "50-54",
    age_of_shooter <= 59 ~ "55-59",
    age_of_shooter <= 64 ~ "60-64",
    age_of_shooter <= 69 ~ "65-69",
    age_of_shooter <= 74 ~ "70-74"))
p1 <- mass_shootings_cln %>% 
  filter(!is.na(age_group)) %>% 
  group_by(age_group) %>% 
  summarise(count = n(), .groups = "drop") %>% 
  mutate(percent = scales::percent(count/sum(count), accuracy = 0.1)) %>% 
  mutate(label_text = str_glue("Age group: {age_group}
                               Count: {count}
                               Percent: {percent}")) %>%
  ggplot(aes(x = age_group, y = count, text = label_text)) +
  geom_col(width = 0.7, fill = "indianred") +
  labs(title = "Age Distribution", x = "age group") 

ggplotly(p1, tooltip = "text")
10-1415-1920-2425-2930-3435-3940-4445-4950-5455-5960-6465-6970-7401020
Age Distributionage groupcount


  • The vast majority of shooters were between 15 and 50 years old.
  • The age distribution is bimodal, with one mode around 23 years of age and a second mode around 41 years of age.
  • Most shooters were in the 20-24 age group (17.3 %) and 40-44 (17.3 %), followed by 25-29 (16.0 %).

🔎 Who was the youngest shooter?

mass_shootings_cln %>% 
  slice_min(age_of_shooter, n = 1) %>% 
  select(case, date, summary, fatalities) %>% 
  knitr::kable()
case date summary fatalities
Westside Middle School killings 1998-03-24 Mitchell Scott Johnson, 13, and Andrew Douglas Golden, 11, two juveniles, ambushed students and teachers as they left the school; they were apprehended by police at the scene. 5

📊 Number of cases per year

p2 <- mass_shootings_cln %>%
  group_by(year) %>%
  summarise(count = n()) %>% 
  ggplot(aes(year, count)) +
  geom_col(fill = "steelblue") + 
  geom_smooth(method = "loess", se = FALSE, color = "indianred", size = 0.7) +
  labs(title = "Number of Cases per Year") 

ggplotly(p2)
198019902000201020200.02.55.07.510.012.5
Number of Cases per Yearyearcount


  • We can see an increase in mass shootings in the last 12 years.
  • 2020 has a smaller number of cases probably due to Covid restrictions.
  • There were only 2 cases of mass shootings in 2024, which is very encouraging!
  • Unfortunately, we don’t see the same trend in 2025.

📊 Fatalities-Injured relationship

p3 <- mass_shootings_cln %>%
  ggplot(aes(x = fatalities, y = injured)) +
  geom_jitter() +
  scale_y_sqrt() +
  labs(title = "Fatalities-Injured Relationship")
  
ggplotly(p3)
02040600100200300400500600
Fatalities-Injured Relationshipfatalitiesinjured


Please note that the Injured values are square root scaled for better visibility, but you can see the actual values by hovering over the points.

Summary of fatalities

summary(mass_shootings_cln$fatalities)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   4.000   5.000   7.487   8.000  60.000 

Summary of injured people

summary(mass_shootings_cln$injured)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    3.00   10.61    9.00  546.00 

📊 Total fatalities by state

🛠️ Data manipulation

# create us states with abbreviations tibble
states_with_abbr <- 
  tibble(state = state.name, abbr = state.abb) %>% 
  bind_rows(tibble(state = "District of Columbia", abbr = "DC"))

# data manipulation
by_state <- mass_shootings_cln %>% 
  # recode D.d. to District of Columbia
  mutate(location = ifelse(location == "Washington, D.C.", 
                           "Washington, District of Columbia", 
                           location)) %>% 
  # separate location into city and state
  separate(location, c("city", "state"), sep = ", ") %>% 
  # group and summarize
  group_by(state) %>% 
  summarise(total_cases = n(),
            total_fatalities = sum(fatalities), .groups = "drop") %>% 
  # add us states abbreviations
  left_join(states_with_abbr, by = "state") %>% 
  # rearrange columns
  select(state, abbr, everything())

📈 Top ten states regarding number of cases and fatalities

by_state %>% 
  arrange(-total_cases, -total_fatalities) %>% 
  head(10)
# A tibble: 10 × 4
   state        abbr  total_cases total_fatalities
   <chr>        <chr>       <int>            <dbl>
 1 California   CA             26              178
 2 Texas        TX             14              162
 3 Florida      FL             13              129
 4 Colorado     CO              8               53
 5 Washington   WA              7               37
 6 New York     NY              6               44
 7 Pennsylvania PA              6               32
 8 Wisconsin    WI              5               28
 9 Illinois     IL              5               25
10 Michigan     MI              5               22

📊 Total fatalities by state visualization

by_state %>% 
  plot_geo(locationmode = 'USA-states') %>% 
  add_trace(z = ~total_fatalities,
            locations = ~abbr,
            color = ~total_fatalities,
            colors = ~"Reds") %>% 
  layout(
    geo = list(
      scope = "usa",
      projection = list(type = "albers usa"),
      lakecolor = toRGB("white")
    )
  )

Shiny app

The app you can see below is embedded in this quarto document since my website is static. It was originally published on shinyapps.io, where you can also interact with it.

Note: If you don’t see the application, I’ve run out of my 25 active hours (when my applications are not idle). Sorry, this is a free account, and my app will not be available again until the following month cycle. Hope you get lucky! 😊

📢 By clicking on each circle, you can read a summary of the mass shooting case.


Thanks for reading!