fastRhockey
is an R Package that is designed to pull play-by-play (and boxscore)
data from the newest version of the Premier Hockey
Federation (PHF) website. In the past, there have been a few
scrapers for the PHF (formerly the NWHL), but they’ve all been
deprecated since the league changed website formats.
With the seventh season of the league kicking off on November 6th, and games being broadcasted on ESPN+, this package was created to allow access to play-by-play data to continue pushing women’s hockey analytics forward.
In Spring of 2021, the Big Data Cup and the data they made available revolutionized what we were able to thanks to the detailed play-by-play data for the season and the x/y location data. That wave continued with the inaugural WHKYHAC conference in July that produced some amazing conversations and projects in the women’s hockey space.
In the past, the lack of data and poor access to data have been the biggest barrier to entry in women’s hockey analytics, a barrier that this package is intended to alleviate.
Installation
You can install the released version of fastRhockey
from GitHub
with:
# You can install using the pacman package using the following code:
if (!requireNamespace('pacman', quietly = TRUE)){
install.packages('pacman')
}
pacman::p_load_current_gh("sportsdataverse/fastRhockey", dependencies = TRUE, update = TRUE)
If you would prefer the devtools
installation:
# if you would prefer devtools installation
if (!requireNamespace('devtools', quietly = TRUE)){
install.packages('devtools')
}
# Alternatively, using the devtools package:
devtools::install_github(repo = "sportsdataverse/fastRhockey")
Once the package has been installed, there’s a ton of stuff you can do. Let’s start by finding a game we’re interested in, say, the 2021 Isobel Cup Championship that the Boston Pride won.
# input the season that you're interested in looking up the schedule for
phf_schedule(season = 2021) %>%
dplyr::filter(game_type == "Playoffs") %>%
dplyr::filter(home_team_short == "MIN" & away_team_short == "BOS") %>%
dplyr::select(game_id,
date_group, facility,
game_type,
home_team, away_team,
home_score, away_score,
winner)
#> ── PHF Schedule Information from PremierHockeyFederation.com ───────────────────
#> ℹ Data updated: 2023-03-08 07:48:46 UTC
#> # A tibble: 1 × 9
#> game_id date_group facility game_…¹ home_…² away_…³ home_…⁴ away_…⁵ winner
#> <int> <chr> <chr> <chr> <chr> <chr> <int> <int> <chr>
#> 1 379254 2021-03-27 Warrior Ice… Playof… Minnes… Boston… 3 4 Bosto…
#> # … with abbreviated variable names ¹game_type, ²home_team, ³away_team,
#> # ⁴home_score, ⁵away_score
A couple of quick filters/selects later and we’ve pared down the data
into a very manageable return. We can see that the Boston Pride beat the
Minnesota Whitecaps 4-3 in Warrior Ice Arena on March 27th, 2021. The
other important column in this return is the game_id
column.
Let’s take that game_id
and plug it into another
fastRhockey
function, this time using the
phf_team_box
function to pull the boxscore data from this
game.
x <- 379254
box <- phf_team_box(game_id = x)
box %>%
dplyr::select(game_id,
team,
total_scoring, total_shots,
successful_power_play, power_play_opportunities,
faceoff_percent, takeaways)
#> ── PHF Team Boxscore Information from PremierHockeyFederation.com ──────────────
#> ℹ Data updated: 2023-03-08 07:48:53 UTC
#> # A tibble: 2 × 8
#> game_id team total_scoring total_shots successful_p…¹ power…² faceo…³ takea…⁴
#> <dbl> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 379254 bos 4 30 2 3 0.684 29
#> 2 379254 min 3 30 1 2 0.316 32
#> # … with abbreviated variable names ¹successful_power_play,
#> # ²power_play_opportunities, ³faceoff_percent, ⁴takeaways
Once again, I’ve selected some specific columns, but this is an
example of the data that is returned by the phf_team_box
function! We have counting stat data on shots/goals, both aggregated and
by period, power play data, faceoff data, and how often a team
takes/gives away the puck. It’s definitely helpful data and I believe
that there are some really fun projects that can be done with just the
phf_team_box
function, but the really good stuff is still
coming.
Turn your attention to phf_pbp
, the function that was
created to return PHF play-by-play data for a given game (i.e. the whole
reason that fastRhockey
exists). It’s a similar format to
the boxscore function where the only input necessary is the
game_id
that you want.
Loading a single game should take ~ 5 seconds. Once it does, it’s
time to have some fun. The phf_pbp
function returns 79
columns, some with “boring” data, like who the teams are, etc. But then
you get to the columns that look at how much time is remaining in a
quarter, what the home skater vs away skater numbers are, what event
occurred, who was involved, and so on.
dplyr::glimpse(pbp)
#> Rows: 207
#> Columns: 79
#> $ play_type <chr> "Goalie", "Goalie", "Faceoff", "Shot", "Shot…
#> $ team <chr> "Boston Pride", "Minnesota Whitecaps", "Bost…
#> $ time <chr> "00:00", "00:00", "00:01", "00:17", "00:24",…
#> $ play_description <chr> "Starting Goalie #35 Lovisa Selander", "Star…
#> $ scoring_team_abbrev <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ scoring_team_on_ice <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defending_team_abbrev <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defending_team_on_ice <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ period_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ game_id <dbl> 379254, 379254, 379254, 379254, 379254, 3792…
#> $ game_date <chr> "Playoffs", "Playoffs", "Playoffs", "Playoff…
#> $ home_team <chr> "Minnesota Whitecaps", "Minnesota Whitecaps"…
#> $ home_location <chr> "Minnesota", "Minnesota", "Minnesota", "Minn…
#> $ home_nickname <chr> "Whitecaps", "Whitecaps", "Whitecaps", "Whit…
#> $ home_abbreviation <chr> "MIN", "MIN", "MIN", "MIN", "MIN", "MIN", "M…
#> $ home_score_total <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ away_team <chr> "Boston Pride", "Boston Pride", "Boston Prid…
#> $ away_location <chr> "Boston", "Boston", "Boston", "Boston", "Bos…
#> $ away_nickname <chr> "Pride", "Pride", "Pride", "Pride", "Pride",…
#> $ away_abbreviation <chr> "BOS", "BOS", "BOS", "BOS", "BOS", "BOS", "B…
#> $ away_score_total <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
#> $ home_goalie <chr> NA, "Amanda Leveille", "Amanda Leveille", "A…
#> $ away_goalie <chr> "Lovisa Selander", "Lovisa Selander", "Lovis…
#> $ home_goalie_jersey <chr> NA, "29", "29", "29", "29", "29", "29", "29"…
#> $ away_goalie_jersey <chr> "35", "35", "35", "35", "35", "35", "35", "3…
#> $ goalie_change <chr> "Starting", "Starting", NA, NA, NA, NA, NA, …
#> $ penalty <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ penalty_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ penalty_level <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ penalty_length <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ on_ice_situation <chr> "Even Strength", "Even Strength", "Even Stre…
#> $ shot_type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ shot_result <chr> NA, NA, NA, "saved", "saved", NA, "saved", N…
#> $ score <chr> "0 - 0 T", "0 - 0 T", "0 - 0 T", "0 - 0 T", …
#> $ minute_start <dbl> 0, 0, 0, 0, 0, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4,…
#> $ second_start <dbl> 0, 0, 1, 17, 24, 35, 47, 22, 36, 6, 12, 31, …
#> $ clock <chr> "20:00", "20:00", "19:59", "19:43", "19:36",…
#> $ offensive_player_jersey_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_jersey_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_jersey_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_jersey_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_jersey_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_jersey_6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ leader <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T",…
#> $ away_goals <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
#> $ home_goals <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
#> $ sec_from_start <dbl> 0, 0, 1, 17, 24, 95, 107, 142, 156, 186, 192…
#> $ power_play_seconds <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ start_power_play <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ end_power_play <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ goalie_involved <chr> NA, NA, NA, "Amanda Leveille", "Amanda Levei…
#> $ time_elapsed <chr> "00:00", "00:00", "00:01", "00:17", "00:24",…
#> $ time_remaining <chr> "20:00", "20:00", "19:59", "19:43", "19:36",…
#> $ player_name_1 <chr> "Starting Goalie Lovisa Selander", "Startin…
#> $ player_name_2 <chr> NA, NA, "Meaghan Pezon", "Amanda Leveille", …
#> $ player_name_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ player_jersey_1 <chr> "35", "29", "14", "17", "17", "21", "21", "2…
#> $ player_jersey_2 <chr> NA, NA, "15", "29", "29", "18", "35", "20", …
#> $ player_jersey_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ home_skaters <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
#> $ away_skaters <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
There’s data on who took a shot (if a shot occurs), as well as who the primary (and secondary) assisters were and who the goalie was. Penalties are recorded + the time assigned for a trip to the box.
One of the more interesting findings from the PHF set-up was that they ID all five offensive players on the ice when a goal is scored, so that’s available as well. Unfortunately, it’s hard to derive any sort of plus/minus stat from this since it’s only the offensive players at the time of a goal. If the offensive and defensive lineups were provided we could create a +/-, but that remains out of reach for now.
Here’s an example of the things that one can now build with the
play-by-play data that is generated from phf_pbp
. This is a
quick graph showing cumulative shot attempts by point in the game for
Boston and Minnesota.
pbp %>%
dplyr::mutate(shot = ifelse(play_type %in% c("PP Goal", "Goal",
"Pen Shot", "Shot",
"Shot BLK"), 1, 0)) %>%
dplyr::group_by(team) %>%
dplyr::mutate(total_shots = cumsum(shot)) %>%
ggplot() +
geom_line(aes(x = sec_from_start, y = total_shots, color = team),
size = 2) +
scale_color_manual(values = c("Boston Pride" = "#b18c1e",
"Minnesota Whitecaps" = "#1c449c")) +
labs(y = "Total Shots",
title = "Boston Pride vs Minnesota Whitecaps - 3/27/2021",
subtitle = "Total Shots by Minute of Game") +
theme_minimal() +
theme(
panel.grid.minor = element_blank(),
axis.line = element_line(size = 1),
legend.position = "bottom",
axis.title.x = element_blank(),
axis.text.x = element_text(size = 11),
plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
plot.subtitle = element_text(face = "italic", hjust = 0.5, size = 12),
legend.title = element_blank()
) +
scale_x_continuous(breaks = c(1200, 2400, 3600, 3800),
labels = c("End 1st", "End 2nd", "End 3rd", " ")) +
scale_y_continuous(limits = c(0, 40))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
#> ℹ Please use the `linewidth` argument instead.
It’s a simple graph, but one that can easily help illustrate game flow. The Pride’s shots came in bunches, taking a ton about halfway through the first and third periods respectively. Minnesota started the game slowly, but their shots came fairly consistently throughout the game.
There’s so much more that can be explored from this play-by-play data, whether you want to explore how winning a faceoff leads to a shot attempt or the chaos that can follow giveaways.
That’s a quick primer on the main functions of the package.
phf_schedule
returns schedule information and game_ids,
which can be used in phf_team_box
or phf_pbp
to return boxscore or play-by-play data. phf_game_all
wraps
the boxscore/play-by-play and several other game summary tables into one
and returns a list with the dataframes: plays, team_box, skaters,
goalies, game_details, scoring_summary, shootout_summary,
penalty_summary, officials, team_staff, timeouts.
The last function that may be of some use is
phf_league_info
, which essentially pulls a lot of
background info on the league and the IDs that are used. The output from
this function gets wrapped into the phf_schedule
, which is
it’s main purpose.
If you look with fastRhockey::
, there are more functions
available, but those are helper functions to pull raw data
(phf_game_raw
) and then to process the raw data into a
usable format (helper_phf____
).
Citations
To cite the fastRhockey
R package in publications, use:
BibTex Citation
@misc{howell_gilani_fastRhockey_2021,
author = {Ben Howell and Saiem Gilani},
title = {fastRhockey: The SportsDataverse's R Package for Hockey Data.},
url = {https://fastRhockey.sportsdataverse.org/},
year = {2021}
}