Thursday, April 30, 2015

Meta Analytics

In this post we take a step back from analytics' measures and metrics, and focus instead on the analytics movement as a whole. From this vantage point we can better respond to the self-perceived shortcomings of the community, such as the notion that analytics aren't accepted by NHL management. More importantly, we see how we can effectively communicate and acquire knowledge as a community. This search for a deeper understanding of the game is why we're all here.


Knowledge


If knowledge is what is known, then the acquisition of knowledge is the process of converting the unknown into the known. Fifteen years ago, few if any knew that shot attempt comparisons would be more predictive of future wins than winning itself. Vic Ferrari and other bloggers discovered this unknown, and knowledge was gained.

One of the greatest fallacies is to treat the unknown as though it doesn't exist at all. Within the analytics community this has resulted in a belittling of 'intangibles' such as work ethic, leadership, and team identity. Just because these are not formally measured (publicly), we cannot discount the impact they potentially have on the success of a hockey club (we would also be discounting the body of research on organizational culture and sport psychology). A myopic prescription to the known has led many to believe the game of hockey comes down to possession and luck. Such a view discounts what we haven't yet measured, and in a sense don't yet know.

The fallacy of treating the unknown as non-existent provides insight into the perceived lack of acceptance of analytics at the NHL level. In all likelihood we underestimate the degree to which NHL teams use analytics because we don't know the degree to which NHL teams use analytics. Brian Burke is still pegged as 'anti-analytics' despite his multiple assertions to the contrary. At the MIT Sloan Conference this past February, Burke states "analytics is a tool in the toolbox, and it’s an important tool" and in a video titled Brian Burke Still Not A Big Analytics Fan For Evaluating Players, he admits "we use [analytics] a great deal". 

In this video, which predates the Summer of Analytics by nearly two years, Calgary Flames Director of Video & Statistical Analysis Chris Snow discusses the PUCKS software, used by 17 NHL teams. This excerpt in particular is noteworthy:
"Our video coach is marking all the events that the league does not. So things such as exits from your own zone, breaking out, entries, dump-ins, anything that will be important for the coaches as a teaching tool [...]. Let's say that we really value how and where we dump the puck. He could have a category that says 'dump-in' and there could be a sub-category such as a 'soft chip' to myself or a 'hard rim'. So we could very quickly drill down and look at how a particular player dumped the puck in as we attempted to get our forecheck going."

PUCKS software is more geared towards video analysis than statistical analysis, but the point is that teams care a great deal about this information and have been tracking it for a long time. It's also in a team's best interest to keep quiet about their analytical activities. As Oilers analytics consultant Tyler Dellow put it, "if you do something good now that crosses my plate I'm not telling anyone about it."

Analytics are intel—business critical information. Considering the competitive advantage they confer, teams have incentives to mislead outsiders about their use of analytics. There are counter-incentives to this, namely that knowledge is best nurtured in a community (think open source), and indeed this post highlights how the analytics community is in itself a competitive advantage. Still, professional sports teams have the monetary incentives and the resources to use and advance analytics. Almost a year removed from the Summer of Analytics, it's likely that the community is now behind most NHL teams with respect to data-driven hockey knowledge. Assuming otherwise would be foolish and serves no practical purpose other than to massage the ego. Such is the nature of the knowledge-fallacy that treats the unknown as non-existent.

Known metrics like Corsi and PDO have their merits, but a focus on what we don't yet know and a determination to make it known are what lead to breakthroughs and new discoveries. As knowledge seekers, it is our responsibility to venture into the unknown and shine a light on what we discover. The challenge is twofold: how do we discover what is unknown, and how do we make it known? To answer these questions, we turn to the Scientific Method.

Discovering the Unknown

I know that I know nothing. —Socrates
Scientific papers begin with the literature review for good reason. Discovering the unknown first requires that we know what's already known. Regularly reviewing the literature helps to refine our current knowledge base, and often uncovers new research and insights previously unknown to us.

The key when reviewing the literature is to seek out the limitations in the research. Limitations highlight the unknowns, providing a clear direction for further research. This is so important that many scientific articles end with a section entitled "Future Directions" which explicitly states what future research can be done to account for the limitations. In Stefan Wolejszo's Five Core Components of a Critical Approach to Hockey Analytics, he writes "put limits of methods front and center." As Tyler Dellow put it, “your work is better if you try and figure out where the mistakes are in it.”

Finding limitations in our own work and the work of others is challenging. One useful approach is to deconstruct the definitions and methodologies used. To use a personal example, this blog joined the Passing Project in response to the shot quality debate. In Tom Awad's paper Does Shot Quality Exist?, shot quality is defined using five factors. The critical question to ask is: do these five factors capture shot quality, or is there more to it? If there is more to shot quality, then the conclusion cannot fully speak to shot quality's existence or lack thereof, and further research is necessary.

Note that this does not invalidate Awad’s research. Not at all. In fact his research improved upon previous methodologies investigating shot quality, and his findings revealed insights into each of the five factors he investigated. Awad’s research provided the foundation for projects like the Passing Project, the Shot Quality Project, and Steve Valiquette’s work on Red and Green shots.

The literature review, in addition to expanding our own knowledge, allows readers to seek out the prerequisite knowledge they need to understand the matter at hand. This is hugely beneficial because it removes barriers to entry, allowing the community to grow at a faster rate. As Wolejszo notes in his excellent piece on confirmation bias, "[...] the success of the scientific method has stemmed from the fact that scientists are very motivated to disprove the work of others rather than a critical attitude that individual scientists bring to their own work." In other words, we need one another to hold each other accountable and to push knowledge forward. The more contributors the community has, the better it is at doing this.

Note that this post is not suggesting that every analytics article be worthy of submission to a scientific journal. Nor is it suggesting that every analytics article have a formal lit review. However, citing the work our contribution builds on—recognizing the shoulders of giants we stand on—is the critical first step to advancing knowledge.

Observation

"They asked the panel ‘what should you do to get into the business?’ and not one person said watch games. They all said work on your algorithms, or whatever else you do. This is still an eyeballs business." —Brian Burke at the Sloan Conference, Feb 2015.
Discovering the unknown requires scientific observation on two fronts: a survey of the literature, as discussed, and observation of the subject itself. According to Wolejszo, "research consistently shows that the best method of mitigating the potential for confirmation bias is having experience within the substantive domain". With respect to hockey analytics, this means that watching and playing the game are vital to producing good research. Steve Valiquette and Chris Boyle, two researchers that have challenged the notion that shot quality is relatively insignificant, are both goaltenders (Valiquette being a former NHL-er). This is not a coincidence. Goalies are trained to stop shots and they know, from experience, that some shots are harder to save than others.

Zone entry research is a great example of a discovery realized because the authors were up to date on the literature and watching the games with a critical eye. The genesis of the research seems to stem from a March 2011 post in which Travis Hughes shows game footage of the Flyers turning the puck over in the neutral zone, and he suggests the Flyers keep it simple by getting more pucks in deep. The following month Eric Tulsky with the help of Geoff Detweiler published zone entry results for their first tracked game. A revelatory finding was the metric 'shot attempts per entry', which combined their novel data with previous knowledge on Corsi (shot attempts). Tulsky and Deitweiler discovered a new way to judge performance because they worked off the literature and drew inspiration from the game itself.


Making the Unknown Known


Collecting data is critical to research projects like Eric Tulsky’s and Steve Valiquette’s. Novel data records information previously unconsidered, and begins the process of making the unknown known. Data collection requires that we watch the games. In other words, we are scientifically mandated to partake in the eye-test.

Rather than doing away with the eye-test, the goal is to proceduralize the eye-test so it doesn't fall prey to biases. The collected data points must have operational definitions that limit observer subjectivity. For example if we're collecting 'shots on goal', what a shot on goal is and isn't must be explicitly defined. Inter-rater reliability testing is also critical. This ensures that different observers collect the same data for the same set of occurrences. Ryan Stimson's Passing Project and Tulsky's zone entry research both use operational definitions and test for inter-rater reliability. The NHL does neither.

The availability of the NHL’s data is a luxury in that the data collection process is taken care of, but it has also created a disconnect in the scientific process. The NHL's data collectors could care less about how the data is analyzed, and those who analyze the data take the collection process for granted. The result is a disturbing amount of error in the data and a fanbase polarized between those who watch the games and those who crunch the numbers. This is a false dichotomy. The reality is that the eye-test and analytics are both integral components of the scientific process.


Conclusion


As a community, we have the potential to push the boundaries of hockey knowledge further than they’ve ever been. This requires a critical reading of the literature, with a focus not on what we already know, but a focus on what we don’t know. Watch the games. If possible, play the game. At the end of the day we’re studying the game of hockey and insights into how we can further study hockey come from the game itself. As a community we have the ability to collect large sets of novel data, which are the foundation of new discoveries and knowledge. Join a public tracking project such as Emmanuel Perry’s Blueline Events or Ryan Stimson’s Passing Project. As more professional teams invest resources into proprietary analytics, we have no choice but to use our advantage in numbers.

The other option is irrelevancy.

* * *

References

McKenzie: The real story of how Corsi got its name. Bob McKenzie. http://www.tsn.ca/mckenzie-the-real-story-of-how-corsi-got-its-name-1.100011

SSAC15: The Future of the Game. https://www.youtube.com/watch?v=XfqMn3Dg-aE

Brian Burke Still Not A Big Analytics Fan For Evaluating Players. https://www.youtube.com/watch?v=miaOJ_ln6rU



SSAC15: Changing on the Fly: The State of Advanced Analytics in the NHL. https://www.youtube.com/watch?v=cjR4lX36i0E






Introducing the Shot Quality Project. Chris Boyle. http://www.sportsnet.ca/hockey/nhl/introducing-the-shot-quality-project/

Video Breakdown: How Steve Valiquette will change how we think about goaltending. Kevin Power. http://www.blueshirtbanter.com/2015/1/6/7500845/video-breakdown-how-steve-valiquette-will-change-how-we-think-about

Revisiting Confirmation Bias. Stefan Wolejszo. http://www.storiesnumberstell.com/revisiting-confirmation-bias/

Step 1, Identify The Problem: Flyers Offensive Game Lacks Simplicity. Travis Hughes. http://www.broadstreethockey.com/2011/3/15/2050075/philadelphia-flyers-turnovers-problem


Using Zone Entry To Separate Offensive, Neutral, And Defensive Zone Performance. Eric Tulsky, Geoffrey Detweiler, Robert Spencer, Corey Sznajder. http://www.sloansportsconference.com/wp-content/uploads/2013/Using%20Zone%20Entry%20Data%20To%20Separate%20Offensive,%20Neutral,%20And%20Defensive%20Zone%20Performance.pdf

Preliminary Analysis of Error in NHL's RTSS Data. 'C' of Stats. http://cofstats.blogspot.ca/2015/02/draft-preliminary-analysis-of-error-in.html

Wednesday, February 25, 2015

NHL Analytics: The Good, The Bad, & The Future

The NHL has officially adopted advanced enhanced #fancystats analytics. Now more than ever, we need to take a critical look at where these analytics stand, and where they are going.


SAT (formerly, Corsi)


SAT (shot attempts or Corsi) is an indirect way of measuring a player or team's offensive zone possession. Since possession time is not directly measured by the league, SAT% measures this by comparing the total shot attempts a team takes versus the total shot attempts that same team allows. This possession proxy is so powerful that it has proven to be more predictive of future wins than any other single metric.

But SAT metrics are not without limitations. To them, every shot attempt is considered equal whether that shot is a one-timer from the slot or a backhander from the point. This equal weighting is useful for approximating possession, but as a result SAT metrics provide no insights into the quality of shots teams generate. As the NHL and its teams adopt these metrics, it's more important than ever to grasp what these metrics are, what they are not, and how they can be used. A misuse now has real consequences.

A former NHL coach is on the record as saying:
"Analytics to me are no more than stats. So if you're for goals and assists and points then why wouldn't you be looking at these other stats too? I was a big fan of [our analytics guy]; I pushed hard for the hire. I like going in depth and checking everything off. You know, if you're going to stand back and say 'hey it takes 95 points to make the playoffs,' can't you say that 'hey if you get a Corsi of fifty percent, seventy percent of those teams make the playoffs,' or 'if you get a team Corsi of fifty-two and a half percent, ninety percent of those teams are going to make the playoffs.' The problem with just looking at 'hey 95 points and you're in', it's hard to go back and, you know, measure 'ok what do you need?' When you get into the team Corsi, and this is where things worked really well with [our analytics guy], is you're actually able to go in and check off every part of your system-work, compare it to other teams that are tops in certain categories, see what they're doing, see if you need to adjust it, where's your personnel at, what are the challenges. But it just adds another layer of putting together a system and a team."
At first glance, this seems like a refreshing use of advanced analytics in the NHL. But upon further analysis, serious misconceptions are evident. Let's go through this, red flag by red flag.

"Analytics to me are no more than stats. So if you're for goals and assists and points then why wouldn't you be looking at these other stats too?"
In and of itself this rhetorical question is innocuous. This note serves only to point out the critical difference between "goals, assists and points" versus "other stats". Goals (and by association, assists) are the only stats that count for wins. Team points are the only stats that count for playoffs. This isn't to say that "other stats" aren't important; indeed the ultimate goal of any organization is to find the right combination of factors - at every level - that lead to success. But an NHL organization must understand that goals and points are the only true measures of success. (These are Laws #1 and #2 of the influential and evolving 2008 piece, The Ten Laws of Hockey Analytics).

"[...] can't you say that 'hey if you get a Corsi of fifty percent, 70% of those teams make the playoffs,' or 'if you get a team Corsi of fifty-two and a half percent, 90% of those teams are going to make the playoffs.'"
The coach is venturing into dangerous territory. His if-Corsi-then-playoffs semantics imply causality. It's true that SAT% and the probability of making the playoffs are strongly related, but there is no evidence to suggest that SAT% causes teams to make the playoffs.

"When you get into the team Corsi [...] you're actually able to go in and check off every part of your system-work".
Further questioning would likely yield nuance, but in and of itself this statement is not true. There's a hell of a lot more to the game and its systems than a comparison of shot attempts. Corsi/shot attempts/SAT is an approximation for possession. Nothing more, nothing less.

Later in the interview, the coach says: "Last year we were at like 44% Corsi [...] and we were able to push that to right around 51 [...]. And that number of 51 over the long term will eventually pay some dividends."
The coach is stating that taking the majority of shots (i.e. having a Corsi greater than 50%) will cause goals and wins - in his words "pay some dividends". This is in direct violation of Hockey Analytics Law #7 which warns: "do not confuse correlation with causation". This is critical. Variables that are non-causal are not to be manipulated directly. Attempts to do so will have no effect on the desired outcome, or worse, have detrimental effects. In the case of this team, they increased their SAT% from 44 to 51 relative to their prior year, yet their record and goal differential declined.


The take-home message


SAT% is not a causal variable. It's an output, not an input. Good teams have good SAT percentages. A good SAT percentage doesn't make a team good.

The ironic reality with SAT is that it is only useful insofar as teams and players do not try to manipulate it directly. The following passage from The Ten Laws of Hockey Analytics serves as a good reminder to this fact:
We know that [SAT] tells us something. But [SAT] is still a measurement of the result and only an indirect observation of the process. It is also a reflection of hockey culture and era. As an example, consider the Traditional Russian Style of hockey. This style absolutely emphasized puck possession, but in this era and culture the puck was never to be wasted on a poor scoring chance – the name of the game was high shot quality. [SAT] would not have been a useful tool.
What contributes to puck possession? [SAT] describes it, today, to a degree. The search is ongoing for better measures of the underlying process. 


PDO (officially, SPSV%)


[Note: PDO is a better acronym than SPSV, so we're sticking with PDO. Percent Defense Offense is a good reading even though this was not the original intent.]

When the coach cited above states that his team's SAT% "will eventually pay some dividends", he's likely referring to PDO, the so-called 'puck luck' stat. PDO adds together a team's save percentage with its shooting percentage, and amazingly most teams are unable to maintain a high or low PDO. Thus, the logic goes, a high PDO reflects a team's good fortunes, and a low PDO bad fortunes.

But keep in mind what the metric actually measures: save percentage + shooting percentage. There are factors other than blind luck that can affect these statistics. A team's poor defensive play can result in a lower save percentage. A team that shoots from the perimeter will have a lower shooting percentage.

PDO is a neat metric, but it raises more questions than it answers. Is the team getting lucky? Or is the team's play actually generating higher quality scoring chances? Or both? Is the goalie not playing well? Getting unlucky? Fatigued? Or is the team allowing higher quality chances against? Are the forwards not back-checking? Are coverages in the D-zone being blown? Is the team turning the puck over too much?

This frustration has been expressed within NHL organizations. Elliotte Friedman quotes one assistant GM as saying, “The thing I hate most about [PDO] is how the (bleep) am I supposed to guess when a player’s luck is supposed to change? Do I just guess? If I trade him when he’s lucky and he continues to stay lucky, are you going to tell your fans, ‘Well, the law of averages said he wasn’t supposed to continue like this’?”

This frustration highlights the current limitations of hockey analytics. SAT% indicates possession and PDO reveals potentially unsustainable levels of shooting and save percentage, but they don't tell us why or how. Hence the search for "better measures of the underlying process."


Zone Exits, Entries and Passing Data.


Zone exits, entries, and passing data are the foundation of the next level of analytics. These datasets capture the systems within the game of hockey - the inputs, the "underlying process". Zone entry data has already shown that controlled carry-ins create more than twice as many shots and goals than dump-ins. Unlike SAT and PDO, these findings prescribe clear directives for teams and players. In this case, teams should carry the puck into the zone whenever possible.

Beyond the direct application at the team level, this data pinpoints specific strengths and weaknesses in a player's game. The same study on zone entries uncovered that Danny Briere was better at entering the offensive zone with possession than his linemate Wayne Simmonds. Yet it was Simmonds who attempted most of the zone entries, a strategical flaw the Flyers could have corrected had they been armed with the information.

Passing data puts the shot quality debate to rest. Early studies indicated that shot location was the only meaningful factor in determining a shot's success rate, but this research didn't consider the sequence of events preceding each shot. From his own datasets, former NHL goaltender Stephen Valiquette has separated shots into "red shots" and "green shots" based on specific criteria, such as whether the shot results from a pass across the Royal Road. His preliminary findings show that 76% of all NHL goals come from green shots. Chris Boyle's Shot Quality Project reveals that a goalie's save percentage is only "0.651 on shots immediately following a pass". Among other pieces of data, Ryan Stimson's Passing Project records the zone from which passes originate, providing insights into rush shots and shots generated from the cycle.

The major hurdle facing these new measures is data acquisition. Zone exits, entries, and passing data are not collected by the NHL, so it's left to dedicated hockey researchers to watch the games and record the data. Many are waiting on Sportvision's "advanced player tracking" technology to provide this data, but questions remain. What data, if any, will be released to the public? And when?

These datasets will also have their own limitations. Assessing intent is critical with regards to passing, but to Sportvision a pass in traffic would be indistinguishable from a play in which the puck is knocked off the player's stick. The technology will undoubtedly revolutionize hockey analytics, but the fact remains that human observation will be an instrumental part of the data collection process for the foreseeable future.

Who said science was easy?


* * *


References

What Statistics Are Meaningful In A Given Season? Steve Burtch. http://www.pensionplanpuppets.com/2013/7/10/4508094/what-statistics-are-meaningful-in-a-given-season-corsi-fenwick-PDO-hits-fights-blocked-shots

The Ten Laws of Hockey Analytics. Alan Ryder(?) http://hockeyanalytics.com/2008/01/the-ten-laws-of-hockey-analytics/

30 Thoughts: Analytics say Oilers luck should turn. Elliotte Friedman. http://www.sportsnet.ca/hockey/nhl/elliotte-friedman-nhl-30-thoughts-edmonton-oilers-john-davidson-montreal-canadiens/

Using Zone Entry Data To Separate Offensive, Neutral, And Defensive Zone Performance. Eric Tulsky, Geoffrey Detweiler, Robert Spencer, Corey Sznajder. http://www.sloansportsconference.com/wp-content/uploads/2013/Using%20Zone%20Entry%20Data%20To%20Separate%20Offensive,%20Neutral,%20And%20Defensive%20Zone%20Performance.pdf

Does Shot Quality Exist? Tom Awad. http://www.hockeyprospectus.com/puck/article.php?articleid=540

Unmasked: Analytics provide new evalutation tools. Kevin Woodley. http://www.nhl.com/ice/news.htm?id=744483


Thursday, February 5, 2015

Preliminary Analysis of Error in NHL's RTSS Data

[Edit: Error in RTSS data was first documented in 2007 by Alan Ryder in this paper. Thanks to an anonymous commenter for pointing this out.]


Introduction


It's no secret that the NHL's Real Time Scoring System (RTSS) is flawed. In 2009, hockey analytics pioneer JLikens wrote about the prevalence of bias in the NHL's shot data. Michael Schuckers and Brian MacDonald have since developed models to correct for the level of statistically-evident bias in each NHL rink. These types of adjustments are helpful, but the true level of error can not be observed without a comparative data set. Few have taken on the challenge of collecting the data themselves. Chris Boyle is one of these brave souls. In 2013, he conducted a comparative analysis in which he found that 10 of 32 shots on goal "were off by more than 10 feet" and that approximately half of the shots "were accurate to within 5 feet."

The analysis herein is similar to Boyle's, but thanks to the incredible work done by WAR On Ice, we can take it a few analytical steps further. Using WAR On Ice's shot plots, we assess the RTSS location error for goals, shots on goal, missed shots, and blocked shots. Accurate shot locations are plotted as an overlay on the RTSS plots, allowing us to clearly see the discrepancies.


Methods


Game selection.
To select which games to analyze, three criteria had to be met. The games had to be (1) recently played, (2) in different rinks and, (3) with no selection bias on my part. The easiest solution was to pick one day with a decent number of games. This day was January 21st 2015.

Location capture.
To capture the location from which each shot was taken, NHL Gamecenter was used. Slow-motion replay captured the location of the puck at the moment prior to the puck moving towards the goal as the result of the shooting motion. Or in the case of deflections, the location of the puck at the moment it was tipped. While I don't have video software that returns precise x,y coordinates, the offensive zone provides concrete reference points like the faceoff circles and hash marks. Using these I was able to locate shots with a high-level of precision.

RTSS shot locations are plotted by WAR On Ice's coloured letters: the red G for goal; the blue S for shot on goal; the black M for missed shot; and the green B for blocked shot. The Actual shot locations are plotted with bolded rings. Black lines have been added to connect RTSS and Actual plots where it may not be obvious that they represent the same shot.


Goal Data

We begin our analysis with goal data. This is the shot-type that affords the official NHL "scorers" the greatest possible benefit of the doubt. This is the only shot-type that comes guaranteed with replays and a 45-second break. In other words, we can assume (but we won't) that goal location data is the most accurate of all the shot-types. Goals are also the rarest of the shot-types, making them easier to distinguish on the plots.

For this comparison the Actual plot rings are colour-coded to reflect the degree of error in the RTSS plots. I'm using the following criteria to measure error: the distance between the Actual and the RTSS plot, the shooting angle difference between the Actual and the RTSS plot, and whether they differ in being in/out of the scoring chance area. Because I want to publish this before the Analytics Conference and don't have time to measure the actual deviations in distance and shooting angle, I'm using the following three-point scale:
  1. Accurate (green). RTSS and Actual plots have some degree of overlap, and all of the error criteria listed above are met.
  2. Acceptable (orange). RTSS and Actual are not a significant distance apart, and at least 2 of the 3 error criteria are met. 
  3. Unacceptable (red). One or fewer criteria are met.
(Feel free to be your own judge, too.)


Results


Note: war-on-ice adjusts RTSS location data, so these results reflect the error in RTSS that persist after being adjusted by war-on-ice.


Game 1 - Toronto @ Ottawa at the Canadian Tire Centre.
  • 4/7 Accurate
  • 2/7 Acceptable
  • 1/7 Unacceptable


Game 2 - Chicago @ Pittsburgh at Consol Energy Center.

  • 3/4 Accurate
  • 1/4 Acceptable


Game 3 - Columbus @ Winnipeg at the MTS Centre

  • 3/4 Accurate
  • 1/4 Unacceptable


Game 4 - Boston @ Colorado at the Pepsi Center.

  • 3/4 Accurate
  • 1/4 Acceptable


Game 5 - Calgary @ Anaheim at the Honda Center

  • 3/8 Accurate
  • 2/8 Acceptable
  • 3/8 Unacceptable



Game 6 - Los Angeles @ San Jose at the SAP Center

  • 1/5 Accurate
  • 2/5 Acceptable
  • 2/5 Unacceptable
  • (The barred goal represents Couture's empty-netter and is therefore not counted)


Totals: 32 non-empty net goals were scored in 6 different buildings.
  • 53% of the goals were accurately plotted (17/32)
  • 25% were acceptably plotted (8/32)
  • 22% were unacceptably plotted (7/32)


Goals at MSG


WAR On Ice co-founder @acthomas noted that their models deal with a relatively high level of error/bias with data from Madison Square Garden. Using the same methods as above, we look at the 5 most recent games played at MSG going back from Jan 21st 2015. This provides a recent sample of games without any selection bias on my part.


Results


Game 1 - January 20th 2015 vs. Ottawa
  • 2/5 Accurate
  • 2/5 Acceptable
  • 1/5 Unacceptable


Game 2 - January 13th 2015 vs. NY Islanders
  • 1/3 Accurate
  • 1/3 Acceptable
  • 1/3 Unacceptable


Game 3 - January 3rd 2015 vs. Buffalo
  • 1/7 Accurate
  • 3/7 Acceptable
  • 3/7 Unacceptable - 1 of these represents a first period goal that is missing from WAR On Ice's shot plot. This could be due to two goals having the exact same coordinates and therefore overlapping each other perfectly.


Game 4 - December 27th 2014 vs. New Jersey
  • 1/3 Accurate
  • 1/3 Acceptable - significant change in shooting angle
  • 1/3 Unacceptable - significant change in shooting angle, and difference in scoring chance area.


Game 5 - December 23rd 2014 vs. Washington
  • 1/6 Accurate
  • 2/6 Acceptable
  • 3/6 Unacceptable - the most egregious of these is a goal that was initially credited to St. Louis in front of the net, but was later credited to Rick Nash from the right circle. This is not necessarily the fault of the scorers, but rather a fault in the nature of real time data.  

Totals at MSG: 24 non-empty net goals were scored
  • 25% were plotted accurately (6/24)
  • 37.5% were plotted acceptable (9/24)
  • 37.5% were plotted unacceptably (9/24)
The data suggests that MSG does indeed have a problem with collecting location data, above and beyond the error already exhibited in every single arena for which data was collected.


Shots on Goal, Missed Shots, Blocked Shots


Below are the shot plots from the first, second, and third periods of the Calgary @ San Jose game which took place on January 17th, 2015.

In the modified plots below, the Actual plot rings are given the same shot-type colour code that WAR On Ice uses to plot the RTSS data. The Actual rings are given diagonal bars in cases where a shot-type discrepancy occurs between the Actual plots and the RTSS plots. The ring colour represents the Actual shot-type, and the bar colour represents the shot-type recorded by RTSS. A red bar indicates RTSS didn't track the shot at all.


1st Period




2nd Period


3rd Period


The same three-point accuracy test used for goals is applied to each shot. Note that due to the volume of shots, it's more difficult to relate each Actual plot to its RTSS counterpart. I've done this as best as I can, offering the benefit of the doubt to RTSS where there is discretion.

In the first three periods of this game, a total of 33 shots on goal and 20 missed shots were taken.


Shots on Goal

  • 30.5% were accurately plotted (10/33)
  • 48.5% were acceptably plotted (16/33)
  • 21% were unacceptably plotted (7/33)


Missed Shots

  • 35% were accurately plotted (7/20)
  • 45% were acceptably plotted (9/20)
  • 20% were unacceptably plotted (4/20)

Shots on goal and missed shots scored similarly. This is to be expected because there is no inherent difference between tracking a shot on goal and a missed shot in terms of location.


Blocked Shots


The Actual and RTSS plots for blocked shots have different operational definitions, making a location comparison impossible. The Actual plots show the location from where each shot is taken. Conversely, RTSS data plots the location of where the shot gets blocked. This is a critical consideration when developing scoring chance metrics using RTSS blocked shot location data. If the NHL is getting serious about stats like Corsi, the league must understand that blocked shots are important because of the shot itself, not because the shot gets blocked. As such, the NHL should ensure that location data for blocked shots captures where on the ice the shot is taken from. (You can track both locations if your heart so desires.) 


Insertion, Deletion, & Substitution Errors


I currently have count discrepancies in the number of shots recorded by RTSS and myself for two complete games. Moving forward, I will compare all of my tracked shot data against RTSS data. It provides for a level of quality control, however minimal, and keeps tabs on the number of shot errors in RTSS data.

I compared my data to RTSS by going through both sets shot by shot, tagging every discrepancy and reviewing it using NHL Gamecenter video. In this analysis, only non-discretionary discrepancies are included. Three types of errors are tracked:
  • Insertion errors - RTSS records a shot where a shot should not be recorded (i.e. false positive)
  • Deletion errors - RTSS does not record a shot where a shot should be recorded (i.e. false negative)
  • Substitution errors - RTSS records the shooting player or shot type incorrectly.
Each shooting play can only be credited with one error, so that the total number of errors represents the total number of faulty shooting plays.


Results


Edmonton @ Calgary - January 31st 2015
  • 3 insertion errors
  • 12 deletion errors
  • 8 substitution errors
Calgary @ San Jose - January 17th 2015
  • 6 insertion errors
  • 11 deletion errors
  • 1 substitution error
As a result, RTSS recorded (for both games combined):
  • a total of 82 shots on goal, when only 75 actually took place (+9% error).
  • a total of 119 missed + blocked shots, when 132 actually took place (-10% error)
Note that WAR On Ice uses Schuckers & MacDonald's rink effect model to adjust shot counts - this may correct some of the error presented here.


Discussion


There is virtually no error in RTSS counts for total shot attempts. Insertions offset deletions, and substitution errors do not affect the totals. However, error could affect analyses conducted on individual players (e.g. iCF) over small sample sizes.

Not surprisingly, shots on goal are over-reported. Scorers in general are trigger happy, often recording dump-ins and broken plays that trickle on to goal as shots. At least RTSS records the zone from which shots are taken; for stats like Corsi, shots from the neutral and defensive zones can be parsed out of the counts. For goalies the result is over-inflated save percentage stats, but relatively this should have little to no effect, especially over large sample sizes.

The RTSS location data is where error abounds. This has a major impact on location-based metrics such as scoring chances. Consistently across all shot types, at least 20% of shots are plotted with a high degree of error. As expected, (non-MSG) goals fare the best, being plotted accurately 53% of the time. Shots on goal and missed shots are plotted accurately around 35% of the time. Blocked shots present an entirely different problem, as RTSS tracks the location of where the shot gets blocked and not where the shot is taken. For scoring chance metrics, this means that RTSS blocked shot plots are virtually inadmissible.

The RTSS location data isn't useless - WAR On Ice has developed predictive scoring chance metrics despite the error. But the potential for improvement is huge, and rests on collecting accurate data. So where do we go from here?

With Sportvision slated to insert itself into every puck and jersey by the start of next season, many of the problems discussed here could be solved. For one, the technology has the ability to track precise shot locations. As of now though, we don't know if and when this data will be made public. Sportvision also has the potential to capture novel data sets such as passing data, but it could be years until they develop the code to parse the raw data, and the question surrounding public availability remains.

The idea that Sportvision is going to solve the community's analytical needs is misguided. Technology has always worked best in conjunction with human ability, and that could not be more true for hockey analysis. Chips in pucks and jerseys capture the location and speed of on-ice events, but only humans can judge intent. Intent-based judgments are critical in differentiating between skilled plays and luck plays, a critical factor for predictive models. This is something technology simply can not assess.

Better data is here, right now, displaying itself openly every time we watch a game. It's time for the community to step up and collect the data - any data we want! - using technology as an aid and adhering to scientific principles. The potential is massive, it's present, and it doesn't rest in the hands of the NHL or Sportvision. It's in our hands.


***

References


"Product Recal Notice for 'Shot Quality'". Alan Ryder. http://hockeyanalytics.com/2007/06/product-recall-notice-for-shot-quality/

"Fancystats community shocked, saddened to learn of passing of hockey analytics pioneer "JLikens", a.k.a. Edmonton lawyer Tore Purdy". Bruce McCurdy. http://blogs.edmontonjournal.com/2014/05/25/fancystats-community-shocked-saddened-to-learn-of-passing-of-hockey-analytics-pioneer-jlikens/

"Home Recording Bias: Shots on Goal". JLikens. http://objectivenhl.blogspot.ca/2009/03/in-previous-posts-it-was-shown-how-some.html

"Accounting for Rink Effects in the National Hockey League's Real Time Scoring System". Michael Schuckers and Brian MacDonald. http://arxiv.org/pdf/1412.1035.pdf

"How Reliable is the NHL.com Shot Tracker?". Chris Boyle. http://www.habseyesontheprize.com/2013/2/20/4005122/how-reliable-is-the-nhl-com-shot-tracker

"NHL.com adding Corsi, Fenwick, enhanced stats next month". Greg Wyshynski. http://sports.yahoo.com/blogs/nhl-puck-daddy/nhl-com-adding-corsi--fenwick--enhanced-stats-next-month-233506566.html

"NHL, Sportvision test program to track players, puck". Corey Masisak. http://www.nhl.com/ice/news.htm?id=750201

Monday, January 12, 2015

Game Report - Calgary vs Vancouver 01/10/15

NOTE: SU stands for setup, and it's awarded to a player who sets up a shot attempt. In other words, SUs are Corsi assists. Why collect this data? Because the passing sequence leading up to a shot reveals so much about the shooting play as a whole, offering a rich and descriptive data set. See the Glossary for a complete list of terms and definitions this blog uses.

Game Breakdown - 5v5 Play


First Period

Vancouver vastly outplays Calgary in all shot and scoring chance breakdowns by at least a 2 to 1 margin. The SUSAs are a bit closer (14 to 8). This is to say that while the Canucks fire 26 shot attempts at the Flames' net, they only set up 14 of them. Calgary sets up 8 of their 10 shot attempts, and they convert on their one SC SUSOG (Scoring Chance SetUp Shot On Goal) of the period, which Gaudreau and Jones setup beautifully for Backlund.



Second Period

Vancouver still outshoots Calgary by a wide margin in all categories, although it's not as bad as in the first period (but still pretty bad).



Third Period 

Vancouver is still down 1-0 and they decide to throw everything at the net. Despite 24 shot attempts, only 3 of them are scoring chances, and only one of these is a scoring chance shot on goal (SC SOG). The shoot-from-anywhere-and-crash-the-net strategy doesn't work for the Canucks and they drop a 1-0 game to the Flames.



Player Breakdown


This data provides a picture of the players involved in shooting attempt plays, both as shooters and passers.

SACo is shot attempt contribution, which is the sum of a players shot attempts (SA) and setups (SU - both SU1s and SU2s are counted). This is different than other sites' definition of shot attempt contribution, which awards any players on the ice at the time of the shot. SOGCo (shot on goal contribution) only counts SU1s, because this is the pass that sets up the shot (in theory the SU2 pass has nothing to do with whether the SU1 pass sets up a shot).

SC Co - scoring chance contribution - sums a player's SC SA (scoring chance shot attempts) with his SC SU (scoring chance setups). Note that players are awarded with a SC SU only if their pass directly contributes to the shot being a scoring chance. In other words, a player can set up a shot attempt, that shot attempt can be a scoring chance, but the SU player will not be awarded with a SC SU if his pass doesn't directly lead to the shot being a scoring chance (i.e. the SC is the result of the shooter's efforts alone). Alright - on to the results!

The Calgary Flames

Mikael Backlund. No wonder he scored. He was the biggest offensive contributor on the night for the Flames. No other Flame contributed more shot attempts and shots on goal, and both he and Byron contributed to the most scoring chances at 3 each (all 5v5).



The Vancouver Canucks

The Canucks took a lot of shot attempts, Edler contributing the most at 15 (8 of which he took himself, setting up 7). Despite this, he was only involved in 2 shots on goal, and no scoring chances. Daniel Sedin and Burrows were the biggest contributors to scoring chances.



Sunday, January 11, 2015

Glossary

SC - Scoring Chance. Any shot taken from the home plate scoring chance area.

SA - Shot Attempt. SAs branch into these mutually exclusive subsets:
  1. SOG - Shot on Goal
  2. MS - Missed Shot
  3. BS Blocked Shot
  4. P Post

SU - Setup. Awarded to players who pass the puck to a teammate who then takes a shot. SU is like an assist for a Shot Attempt (so naturally we also collect SU2s - the 2nd assist of a shot attempt).

Tr - Transition play. A shot in which the first or second setup pass comes from the defensive or neutral zones.

Cy - Cycle play. Shot attempts in which both the first and second setup passes come from the offensive zone.

Combinations

These acronyms are used in conjunction with one another to describe a play. For example, SC SA is a scoring chance shot attempt. SUSOG is a Setup Shot on Goal (I'm paying special attention to these because SUSOGs are about 50% more likely to go in than SOGs that aren't setup by a pass). 
You can get as crazy as you like with these - I read a Tr SC SUSOG as a transition scoring chance setup shot on goal.

Saturday, January 10, 2015

Game Report - Florida vs. Calgary 01/09/2015

Glossary


SA - Shot Attempt. Any shot that is on goal, missed, or blocked. (Also included are non-shots - see Methods for a definition).
SOG - Shot on Goal.
SU - Setup. Awarded to players who pass the puck to a teammate who then takes a shot. In other words these are shot "assists." SUSA is a Setup Shot Attempt; SUSOG is a Setup Shot on Goal. I'm paying special attention to these because SUSOGs are about 50% more likely to go in than SOGs that aren't setup by a pass.
Tr - Transition. A shot in which the first or second setup pass comes from the defensive or neutral zones. TrSA is a shot attempt in transition.
Cy - Cycle. Any shot where both the first and second setup passes come from the offensive zone. CySOG is a shot on goal coming from the cycle.
SC - Scoring Chance. SC SA is a scoring chance shot attempt. SC SUSOG is a scoring chance setup by a teammate that results in a shot on goal.Goals

Breaking Down the 5v5 Shots


The first period may have been the worst period of hockey the Flames have played. Ever. Florida had more than twice as many shot attempts, and they setup 5 scoring chance shots on goal (all in the first 15 minutes of play). To put this in perspective, Calgary only allowed Detroit to setup one scoring chance shot on goal in the entirety of Wednesday's game.  




The Flames somehow came out of the first period with a 2-2 tie, thanks to a brutal giveaway by Ekblad and a weak powerplay goal by TJ Brodie (Al Montoya stopped the puck, then kicked it in. This was a 6-1 game if Luongo was in net.)

The rest of the game evened out, but the Flames got exactly what they deserved (no matter how bad Montoya is).



Goal Breakdown


First Period

0 - 1
Matt Stajan (EV) unassisted.
Ekblad is pressured and sends a blind pass from behind his net into the scoring chance area. Stajan says thank you and blasts the puck home. Tough one for the 18 year-old.

1-1
Jonathan Huberdeau (EV) assisted by B. Boyes.
Hiller goes behind the net to play a slow-moving puck. Not slow enough, apparently, and it goes right by Hiller to Brad Boyes. He whips it out front to Huberdeau who has a wide open net.

2-1
Brad Boyes (EV) assisted by J. Huberdeau.
Another giveaway by the Flames in their own zone, this time by Wideman. He gives Huberdeau a gentle pass up the sideboards, who fires a beautiful pass across to Boyes in front of the net. He tips it in past Hiller.

2-2
TJ Brodie (PP) assisted by M. Giordano and J. Colborne.
The Flames get set up in the offensive zone. A couple passes around the perimeter gets the puck to TJ Brodie. He's at the top of the circle against the boards, and fires a wrist shot on net (i.e. not a great shot). Montoya makes the save, then kicks the puck into his own net.

Second Period

2 - 3
Mikael Backlund (EV) assisted by L. Bouma and D. Wideman.
Bouma misses the net and the puck rings around the boards to Wideman. Wideman shoots the puck (surprise). The rebound goes to Backlund in the scoring chance area, and he backhands the puck in.

3 - 3
Sean Bergeinheim (EV) assisted by T. Fleischmann and D. Bolland.
This play was a microcosm of the game. Calgary gives the puck away twice in their own zone in a span of 3 seconds. Lo and behold, the puck ends up in the back of their net.

3 - 4
TJ Brodie (EV) assisted by L. Bouma and M. Backlund.
Brodie takes a slapshot from the blueline. There are a few bodies in front, and the puck squeaks through Montoya's leg.

4- 4
Jimmy Hayes (EV) assisted by J. Jokinen and E. Gudbranson
Florida wins a faceoff in the offensive zone and are quickly able to setup Hayes who takes a one-timer from the high slot. Not the hardest shot in the world, and a save Hiller needs to make, but a nice setup by Florida nonetheless.

Third Period

5 - 4 
Brian Campbell (EV) unassisted.
It's hard to believe that there could be a worse goaltending performance than the one by Montoya. Enter Jonas Hiller. The arc on Campbell's fanned point shot would make any basketball player proud.

5 - 5
Matt Stajan (EV) assisted by D. Jones and L. Bouma.
No matter what you say about Matt "franchise player" Stajan, he had a knack for going to the net in this game, which is never a bad idea, especially in a game like this. Jones gets a good opportunity, and Stajan finds the pin-balling puck in front of the net and pots it.

6 - 5
T. Fleischmann assisted by D. Bolland and S. Bergenheim.
After another breakdown, Calgary gets exactly what they deserve. Wideman mistakes Bergenheim for his figure skating partner behind the net, allowing Bergenheim to retrieve the puck. He sends it out front, and a few bounces later Fleishmann scores the game winning goal.


***
The raw data is available upon request.
***

DISCLAIMER: My data differs from other sources. I haven't compared my data against these other sources. But keep in mind the play-by-play is collected live. I watch, re-watch, and re-re-watch certain plays to make sure I record them correctly.


Friday, January 9, 2015

Game Report - Detroit vs Calgary 01/07/15

Glossary


SA - Shot Attempt. Any shot that is on goal, missed, or blocked. (Also included are non-shots - see Methods for a definition).
SOG - Shot on Goal.
SU - Setup. Awarded to players who pass the puck to a teammate who then takes a shot. In other words, these players "setup" a shot. SUSA is a Setup Shot Attempt; SUSOG is a Setup Shot on Goal. I'm paying special attention to these because SUSOGs are about 50% more likely to go in than non SUSOGs.
Tr - Transition. A shot in which the first or second setup pass comes from the defensive or neutral zones. TrSA is a shot attempt in transition.
Cy - Cycle. Any shot where both the first and second setup passes come from the offensive zone. CySOG is a shot on goal coming from the cycle.
SC - Scoring Chance. SC SA is a scoring chance shot attempt. SC SUSOG is a scoring chance setup by a teammate that results in a shot on goal.

Breaking Down the 5v5 Shots


All Shot Attempts

This was an incredibly even game across all shot categories. Detroit barely won the Corsi battle. Scoring Chance Shot Attempts were even.



Setup Shots

I'm keeping special tabs on SUSAs (Setup Shot Attempts) because it's been shown that shots that are the result of a pass (i.e. SUSOGs) are about 50% more likely to go in. Again, Detroit and Calgary were incredibly close in this respect. We can also see that 5v5 play wasn't the most exciting. Combined the teams were only able to setup 4 scoring chances (2 each).


Transition and Cycle Shots.

Transition shots are any shots in which either of the two setup passes come from the defensive or neutral zones. For cycle shots, both setup passes come from the offensive zone. This shows us how teams are generating their shots. Both teams generated more attempts from transition play. Sheahan's goal came from transition, and considering Engelland's gaffe on the play this looks like a good way to exploit bad defenseman. Raymond's goal came from a rebound off a transition shot.




The most glaring difference between the teams comes from the cycle. Detroit was able to setup 2 scoring chance shots on goal (SC CySOG), one of which resulted in Zetterberg's goal. Calgary generated 0 such plays. Zero. This, in conjunction with Engelland's error, was where the game was lost for the Flames.























Breaking Down the Goals


Mason Raymond (EV) assisted by M. Backlund and D. Jones.

A transition play that saw Raymond fire the puck at Mrazek from the sideboards. Not the best shot in and of itself, but it stunned Mrazek and gave Raymond the time to pick up the rebound and wrap the puck around the net and in. Great individual effort by Raymond.

Riley Sheahan (EV) assisted by D. Helm.

A transition setup that went from Sheahan to Helm, back to Sheahan in the neutral zone. Sheahan then skates the puck into the offensive zone in an innocuous looking one-on-one between him and Engelland. Engelland lets Sheahan in way too deep, and with one flick of the stick Sheahan is right in front of the net. He roofs the puck over Ramo. A beautiful individual effort by Sheahan, but a good defenseman does not let that happen. (I knew I'd have to rag on Engelland sooner or later, but I wasn't expecting it to be 8 minutes into the first game.)

Henrik Zetterberg (EV) assisted by J. Abdelkader and G. Nyquist.

A scoring chance setup from the cycle (sort of). Zetterberg floats in behind the defense, right in front of the net. Nyquist spots him from the boards and fires a pass his way. It was lucky that the puck found its way to Zetterberg after being tipped by Abdelkader, but Zetterberg cannot be allowed that much free ice directly in front of the net. Monahan needs to cover him.

Justin Abdelkader (PP) assisted by G. Nyquist and H. Zetterberg.

Detroit does exactly what any powerplay aims to do: setup a player in the scoring chance area. Nyquist sets up Abdelkader so beautifully that Abdelkader has the time to hit the post and bury the rebound before any defenders (goalie included) get back into position.

Mikael Backlund (SH) assisted by P. Byron and T. Brodie.

Detroit does exactly what any powerplay aims NOT to do: allow a shorthanded goal. A bad pass by Weiss gets blocked, leading to a 2-on-1 for Calgary. Byron is able to feed the puck to Backlund in the scoring chance area. With Mrazek sliding across, Backlund backhands the puck along the ice and in.


Game Summary

This was an incredibly even game, both teams generating similar shooting and scoring chance numbers. Detroit won the game by taking advantage of Engelland, setting up a scoring chance for Zetterberg, and converting on the powerplay.


***
The raw data is available upon request.
***

DISCLAIMER: My data differs from other sources. For example in this game I tracked 24 scoring chances (5v5), whereas war-on-ice has 30. I reviewed some of the plays where our data differs, and I'm sticking with my numbers (it's likely we define scoring chances differently). I'll conduct a more thorough comparison and update the results as necessary.


Wednesday, January 7, 2015

Appendix to Methods - Data Accuracy

Accurate data is the foundation of any research project. If the raw data is bad, resulting analyses suffer. This post isn't suggesting that data from the mainstream stats providers is useless, but it is not error-free. I've personally noticed errors in NHL play-by-play files, and rink bias is a well documented phenomenon. I'm unaware of the precise methodology used by these stats providers, but their data sets suggest they do not follow the scientific method.

This blog and the Passing Project as a whole are putting sound methodologies in place to ensure the accuracy of our data. Most importantly, we test for inter-rater reliability. Inter-rater reliability is the process of making sure two or more "raters" collect the same data for the same games. While this is difficult to do with few data trackers, we test all games for which there are multiple trackers. The reliability of the data increases as the number of trackers increase (just one of the many reasons you should join the Passing Project! Hit us up on Twitter @cofstats / @RK_Stimp or send an email to hockeypassingstats@gmail.com.)

Inter-rater reliability testing is critical for several reasons:

  1. It corrects for errors. We're human. We miss things, we make typos, you name it. 
  2. It corrects for bias. I'm a Flames fan. I think I'm unbiased because I started this blog to gain a deeper understanding of the Calgary Flames, good or bad. But at the end of the day it doesn't matter what I think.
  3. Inter-rater reliability testing highlights data points in which there is disagreement among trackers. These specific plays can be reviewed to ensure trackers know how to appropriately code that type of play. Tracker disagreement can also suggest the need for improved data definitions.
It cannot be said enough: the accuracy of the data is critical. Without it, all else fails. We're paying our due diligence here at the Passing Project.

***
If you'd like to join the Passing Project and collect data for an NHL team, you can reach out to Ryan Stimson on Twitter @RK_Stimp / or by email hockeypassingstats@gmail.com
***



References

http://en.wikipedia.org/wiki/Inter-rater_reliability