Skip to contents

fastRhockey is an R Package that is designed to pull play-by-play (and boxscore) data from the newest version of the Premier Hockey Federation (PHF) website. In the past, there have been a few scrapers for the PHF (formerly the NWHL), but they’ve all been deprecated since the league changed website formats.

With the seventh season of the league kicking off on November 6th, and games being broadcasted on ESPN+, this package was created to allow access to play-by-play data to continue pushing women’s hockey analytics forward.

In Spring of 2021, the Big Data Cup and the data they made available revolutionized what we were able to thanks to the detailed play-by-play data for the season and the x/y location data. That wave continued with the inaugural WHKYHAC conference in July that produced some amazing conversations and projects in the women’s hockey space.

In the past, the lack of data and poor access to data have been the biggest barrier to entry in women’s hockey analytics, a barrier that this package is intended to alleviate.


Installation

You can install the released version of fastRhockey from GitHub with:

# You can install using the pacman package using the following code:
if (!requireNamespace('pacman', quietly = TRUE)){
  install.packages('pacman')
}
pacman::p_load_current_gh("sportsdataverse/fastRhockey", dependencies = TRUE, update = TRUE)

If you would prefer the devtools installation:

# if you would prefer devtools installation
if (!requireNamespace('devtools', quietly = TRUE)){
  install.packages('devtools')
}
# Alternatively, using the devtools package:
devtools::install_github(repo = "sportsdataverse/fastRhockey")

Once the package has been installed, there’s a ton of stuff you can do. Let’s start by finding a game we’re interested in, say, the 2021 Isobel Cup Championship that the Boston Pride won.

# input the season that you're interested in looking up the schedule for
phf_schedule(season = 2021) %>%
  dplyr::filter(game_type == "Playoffs") %>%
  dplyr::filter(home_team_short == "MIN" & away_team_short == "BOS") %>%
  dplyr::select(game_id,
                date_group, facility,
                game_type, 
                home_team, away_team,
                home_score, away_score,
                winner)
#> ── PHF Schedule Information from PremierHockeyFederation.com ───────────────────
#>  Data updated: 2023-03-08 07:48:46 UTC
#> # A tibble: 1 × 9
#>   game_id date_group facility     game_…¹ home_…² away_…³ home_…⁴ away_…⁵ winner
#>     <int> <chr>      <chr>        <chr>   <chr>   <chr>     <int>   <int> <chr> 
#> 1  379254 2021-03-27 Warrior Ice… Playof… Minnes… Boston…       3       4 Bosto…
#> # … with abbreviated variable names ¹​game_type, ²​home_team, ³​away_team,
#> #   ⁴​home_score, ⁵​away_score

A couple of quick filters/selects later and we’ve pared down the data into a very manageable return. We can see that the Boston Pride beat the Minnesota Whitecaps 4-3 in Warrior Ice Arena on March 27th, 2021. The other important column in this return is the game_id column.

Let’s take that game_id and plug it into another fastRhockey function, this time using the phf_team_box function to pull the boxscore data from this game.

x <- 379254

box <- phf_team_box(game_id = x)

box %>%
  dplyr::select(game_id, 
                team, 
                total_scoring, total_shots,
                successful_power_play, power_play_opportunities,
                faceoff_percent, takeaways)
#> ── PHF Team Boxscore Information from PremierHockeyFederation.com ──────────────
#>  Data updated: 2023-03-08 07:48:53 UTC
#> # A tibble: 2 × 8
#>   game_id team  total_scoring total_shots successful_p…¹ power…² faceo…³ takea…⁴
#>     <dbl> <chr>         <int>       <int>          <dbl>   <dbl>   <dbl>   <dbl>
#> 1  379254 bos               4          30              2       3   0.684      29
#> 2  379254 min               3          30              1       2   0.316      32
#> # … with abbreviated variable names ¹​successful_power_play,
#> #   ²​power_play_opportunities, ³​faceoff_percent, ⁴​takeaways

Once again, I’ve selected some specific columns, but this is an example of the data that is returned by the phf_team_box function! We have counting stat data on shots/goals, both aggregated and by period, power play data, faceoff data, and how often a team takes/gives away the puck. It’s definitely helpful data and I believe that there are some really fun projects that can be done with just the phf_team_box function, but the really good stuff is still coming.

Turn your attention to phf_pbp, the function that was created to return PHF play-by-play data for a given game (i.e. the whole reason that fastRhockey exists). It’s a similar format to the boxscore function where the only input necessary is the game_id that you want.

a <- Sys.time()

pbp <- phf_pbp(game_id = x)
  
Sys.time() - a
#> Time difference of 6.959072 secs

Loading a single game should take ~ 5 seconds. Once it does, it’s time to have some fun. The phf_pbp function returns 79 columns, some with “boring” data, like who the teams are, etc. But then you get to the columns that look at how much time is remaining in a quarter, what the home skater vs away skater numbers are, what event occurred, who was involved, and so on.

dplyr::glimpse(pbp)
#> Rows: 207
#> Columns: 79
#> $ play_type                 <chr> "Goalie", "Goalie", "Faceoff", "Shot", "Shot…
#> $ team                      <chr> "Boston Pride", "Minnesota Whitecaps", "Bost…
#> $ time                      <chr> "00:00", "00:00", "00:01", "00:17", "00:24",…
#> $ play_description          <chr> "Starting Goalie #35 Lovisa Selander", "Star…
#> $ scoring_team_abbrev       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ scoring_team_on_ice       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_1   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_2   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_3   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_4   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_5   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_name_6   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defending_team_abbrev     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defending_team_on_ice     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_1   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_2   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_3   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_4   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_5   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_name_6   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ period_id                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ game_id                   <dbl> 379254, 379254, 379254, 379254, 379254, 3792…
#> $ game_date                 <chr> "Playoffs", "Playoffs", "Playoffs", "Playoff…
#> $ home_team                 <chr> "Minnesota Whitecaps", "Minnesota Whitecaps"…
#> $ home_location             <chr> "Minnesota", "Minnesota", "Minnesota", "Minn…
#> $ home_nickname             <chr> "Whitecaps", "Whitecaps", "Whitecaps", "Whit…
#> $ home_abbreviation         <chr> "MIN", "MIN", "MIN", "MIN", "MIN", "MIN", "M…
#> $ home_score_total          <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ away_team                 <chr> "Boston Pride", "Boston Pride", "Boston Prid…
#> $ away_location             <chr> "Boston", "Boston", "Boston", "Boston", "Bos…
#> $ away_nickname             <chr> "Pride", "Pride", "Pride", "Pride", "Pride",…
#> $ away_abbreviation         <chr> "BOS", "BOS", "BOS", "BOS", "BOS", "BOS", "B…
#> $ away_score_total          <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
#> $ home_goalie               <chr> NA, "Amanda Leveille", "Amanda Leveille", "A…
#> $ away_goalie               <chr> "Lovisa Selander", "Lovisa Selander", "Lovis…
#> $ home_goalie_jersey        <chr> NA, "29", "29", "29", "29", "29", "29", "29"…
#> $ away_goalie_jersey        <chr> "35", "35", "35", "35", "35", "35", "35", "3…
#> $ goalie_change             <chr> "Starting", "Starting", NA, NA, NA, NA, NA, …
#> $ penalty                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ penalty_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ penalty_level             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ penalty_length            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ on_ice_situation          <chr> "Even Strength", "Even Strength", "Even Stre…
#> $ shot_type                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ shot_result               <chr> NA, NA, NA, "saved", "saved", NA, "saved", N…
#> $ score                     <chr> "0 - 0 T", "0 - 0 T", "0 - 0 T", "0 - 0 T", …
#> $ minute_start              <dbl> 0, 0, 0, 0, 0, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4,…
#> $ second_start              <dbl> 0, 0, 1, 17, 24, 35, 47, 22, 36, 6, 12, 31, …
#> $ clock                     <chr> "20:00", "20:00", "19:59", "19:43", "19:36",…
#> $ offensive_player_jersey_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_jersey_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_jersey_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_jersey_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_jersey_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ offensive_player_jersey_6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_1 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ defensive_player_jersey_6 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ leader                    <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T",…
#> $ away_goals                <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
#> $ home_goals                <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
#> $ sec_from_start            <dbl> 0, 0, 1, 17, 24, 95, 107, 142, 156, 186, 192…
#> $ power_play_seconds        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ start_power_play          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ end_power_play            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ goalie_involved           <chr> NA, NA, NA, "Amanda Leveille", "Amanda Levei…
#> $ time_elapsed              <chr> "00:00", "00:00", "00:01", "00:17", "00:24",…
#> $ time_remaining            <chr> "20:00", "20:00", "19:59", "19:43", "19:36",…
#> $ player_name_1             <chr> "Starting Goalie  Lovisa Selander", "Startin…
#> $ player_name_2             <chr> NA, NA, "Meaghan Pezon", "Amanda Leveille", …
#> $ player_name_3             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ player_jersey_1           <chr> "35", "29", "14", "17", "17", "21", "21", "2…
#> $ player_jersey_2           <chr> NA, NA, "15", "29", "29", "18", "35", "20", …
#> $ player_jersey_3           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ home_skaters              <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
#> $ away_skaters              <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…

There’s data on who took a shot (if a shot occurs), as well as who the primary (and secondary) assisters were and who the goalie was. Penalties are recorded + the time assigned for a trip to the box.

One of the more interesting findings from the PHF set-up was that they ID all five offensive players on the ice when a goal is scored, so that’s available as well. Unfortunately, it’s hard to derive any sort of plus/minus stat from this since it’s only the offensive players at the time of a goal. If the offensive and defensive lineups were provided we could create a +/-, but that remains out of reach for now.

Here’s an example of the things that one can now build with the play-by-play data that is generated from phf_pbp. This is a quick graph showing cumulative shot attempts by point in the game for Boston and Minnesota.

pbp %>%
  dplyr::mutate(shot = ifelse(play_type %in% c("PP Goal", "Goal",
                                    "Pen Shot", "Shot", 
                                    "Shot BLK"), 1, 0)) %>%
  dplyr::group_by(team) %>%
  dplyr::mutate(total_shots = cumsum(shot)) %>%
  ggplot() +
  geom_line(aes(x = sec_from_start, y = total_shots, color = team),
            size = 2) +
  scale_color_manual(values = c("Boston Pride" = "#b18c1e",
                                "Minnesota Whitecaps" = "#1c449c")) +
  labs(y = "Total Shots",
       title = "Boston Pride vs Minnesota Whitecaps - 3/27/2021",
       subtitle = "Total Shots by Minute of Game") +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),
    axis.line = element_line(size = 1),
    legend.position = "bottom",
    axis.title.x = element_blank(),
    axis.text.x = element_text(size = 11),
    plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
    plot.subtitle = element_text(face = "italic", hjust = 0.5, size = 12),
    legend.title = element_blank()
  ) +
  scale_x_continuous(breaks = c(1200, 2400, 3600, 3800),
                     labels = c("End 1st", "End 2nd", "End 3rd", " ")) +
  scale_y_continuous(limits = c(0, 40))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#>  Please use `linewidth` instead.
#> Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
#>  Please use the `linewidth` argument instead.

It’s a simple graph, but one that can easily help illustrate game flow. The Pride’s shots came in bunches, taking a ton about halfway through the first and third periods respectively. Minnesota started the game slowly, but their shots came fairly consistently throughout the game.

There’s so much more that can be explored from this play-by-play data, whether you want to explore how winning a faceoff leads to a shot attempt or the chaos that can follow giveaways.

That’s a quick primer on the main functions of the package. phf_schedule returns schedule information and game_ids, which can be used in phf_team_box or phf_pbp to return boxscore or play-by-play data. phf_game_all wraps the boxscore/play-by-play and several other game summary tables into one and returns a list with the dataframes: plays, team_box, skaters, goalies, game_details, scoring_summary, shootout_summary, penalty_summary, officials, team_staff, timeouts.

The last function that may be of some use is phf_league_info, which essentially pulls a lot of background info on the league and the IDs that are used. The output from this function gets wrapped into the phf_schedule, which is it’s main purpose.

If you look with fastRhockey::, there are more functions available, but those are helper functions to pull raw data (phf_game_raw) and then to process the raw data into a usable format (helper_phf____).


Follow SportsDataverse on Twitter and star this repo

Twitter Follow

GitHub stars

Our Authors

Our Contributors (they’re awesome)

Citations

To cite the fastRhockey R package in publications, use:

BibTex Citation

@misc{howell_gilani_fastRhockey_2021,
  author = {Ben Howell and Saiem Gilani},
  title = {fastRhockey: The SportsDataverse's R Package for Hockey Data.},
  url = {https://fastRhockey.sportsdataverse.org/},
  year = {2021}
}