Information

How can we determine an appropriate rate of decay for our scoring algorithms?

How can we determine an appropriate rate of decay for our scoring algorithms?

We build scoring algorithms for different customers in an enterprise environment. Typically they tend to measure most goals in either weekly, monthly, or quarterly intervals. We use a scoring algorithm to score individuals based on performance relative to goals and other metrics.

We assume that the rate of decay should vary based on the goal interval they primarily use. Meaning, a company focused on weekly goals should have a faster decay rate than one focused on quarterly goals.

What's a good way to determine how quickly to begin the decay process based on past performance relative to metric interval length?


The game League of Legends utilised an Elo Rating System which incorporated within it a decay system. The decay worked on a user score and reduced it dependent on both user performance and time. Here, user score was an indicator of skill level and was used in another formula to calibrate the actual scores they received from other in-game activities. So, while this system may not be directly applicable in a customer engagement scenario, its worth looking into first.

The system calculated score based on this formula:

Ra_Old = Ra_New + K(Sa-Ea)

The sytem had various category of players such as:

  • Bronze: Between 0 and 1149 (Team: 0-1249) (Top 100%)
  • Silver: Between 1150 and 1499 (Team: 1250-1449) (Top 68%-13%) Majority of Active Player Base
  • Gold: Between 1500 and 1849 (Team: 1450-1649) (Top 13%-1.5%)
  • Platinum: Between 1850 and 2199 (Team: 1650-1849) (Top 1.5%-0.1%)
  • Diamond: 2200 and above (Team: 1850+) (Top 0.1%)

and varying decay rates based on certain player properties:

  • Elo decayed at a rate of 50 Elo for Diamonds, 35 Elo for Platinums, 25 Elo for Golds, 10 Elo for Silver, and 0 Elo for Bronze for every 4 consecutive weeks of inactivity and every 7 days thereafter.
  • For normal rating, inactivity was defined as no activity in any queue.
  • For ranked rating, inactivity was defined as no activity in the specific queue
  • Ranked decay only applied to people who were ranked above 1400 rating.
  • The decay timer was reset after a game was played in that specific queue.

The major changes that will be required in the case of performance scoring will be:

  • Less aggressive scoring than in this system
  • Decay rates calibrated to maintain a bell curve distribution centred depending on the level of competition you wish to achieve.
  • A bell curve spread such that the leaders are not out of reach of the majority of players as that may end up demotivating the users.

13 Answers 13

It sounds like you are after a robust way to measure the similarity of two pages.

Given that the structure of the page won't change that much, we can reduce the problem to testing whether the text on the page is roughly the same. Of course, with this approach the problems alluded to by nickf regarding a photographers page are still there but if you are mainly concerned with Yahoo! news or the like this should be okay.

To compare to pages, you can use a method from machine learning called "string kernels". Here's an early paper a recent set of slides on a R package and a video lecture.

Very roughly, a string kernel looks for how many words, pairs of words, triples of words, etc two documents have in common. If A and B are two documents and k is a string kernel then the higher the value of k(A,B) the more similar the two documents.

If you set a threshold t and only say two documents are the same for k(A,B) > t you should have a reasonably good way of doing what you want. Of course, you'll have to tune the threshold to get the best results for your application.

You can detect that two pages are the same by using some sort of similarity metric such as the cosine similarity. Then you would have to define a minimum threshold that you can use to accept whether the two documents are the same. For example, I would pick a value closest to 1 when applying the cosine measure, since it ranges from -1 for totally different and 1 for identical.

For this kind of problem I find a search through academic papers to be much better than asking StackOverflow, when dealing with specifics the experts are often much smarter than the crowd.

Every webcrawler or search engine has this problem and has solved it. There is probably a good approach using a kernel based method like the accepted answer is suggesting, but you probably want to start with simpler techniques that are known to work well. You can move to kernel methods afterwards and test to see if they improve your results.

and you'll probably be looking at generating a Rabin fingerprint as a first step with 'Fingerprinting by random polynomials' Rabin 1986.

I use vgrep for that sort of stuff.

It's a little known tool called visual-grep which relies on advanced technology like the sapient ocular device and visual cortex for very quickly determining the sameness of pages side-by-side, and it's remarkably accurate and efficient (it ought to be since it's been under development for quite a long time).

Marking community wiki in case the humor police are out today :-).

Depending on what you're doing, you might be interested in TemplateMaker. You give it some strings (such as web pages) and it marks out the bits that change.

In your Yahoo! News example, you'd fetch the page once and tell TemplateMaker to learn it. Then you'd fetch it again and tell it to learn that one.

When you were happy that your TemplateMaker knew what was the same every time, you could fetch another page and ask TemplateMaker whether it matched the template from the others. (It would give you the pieces that had changed, if you were interested in that.)

You could use a web browser component to render a screenshot of the two pages, and then compare the images. Might be the simplest option.

Without intimate knowledge of the structure of the pages you're trying to compare, then this could be very tricky. That is, how is a machine supposed to tell that a page with a couple of different pictures is the same - if it's a news site with ads then it should be the same, but if it's a photographer's portfolio, then it's definitely different.

If you do know the structure of the page, then what I'd do is manually select portions of the page (using IDs, CSS selectors, XPath, etc) to compare. For example, only compare the #content divs between page refreshes. From there, you might need to add a tolerance level to a char-by-char comparison.

There's a service which does a similar thing, actually. It's called Rsspect (written by Ryan North of Qwantz fame), which will detect changes to any website and create an RSS feed out of it, even if you don't control the page.

You could generate a MD5 hash of each of them, then compare that. Like you said, easy enough.

What you're looking for is a technique for comparing two pages that have arbitrary elements that can change. It's a hard problem.

  1. Identify the areas in a page which can change and you don't care about. Careful! They will always move around.
  2. Hash or do some checksum of the DOM of just the parts of the page you DO care about. Careful! These also will always be changing.

You are up against the first rule of screen scraping: The page is inherently volatile. So it's a tough problem. Your solution will NEVEr be robust enough to account for the infinite variety of subtle changes your source data will be subject to, unless you also have direct control over the source pages and can design your solution against that.

Good luck! I've had experience with systems that tried to solve this problem and it's indeed a tough nut to crack.

The way to do this is to not compare the whole page, because as you say a Human wouldn't be tricked by that either. Say you are interested in the news articles of a Yahoo! page, so then you should look just at the news section. Then you can do whatever, a hash or a literal comparison between the new and old version.

The first thought that came into my head was to process the pages into XML documents with BeautifulSoup (Python), run a diff on them, and count the number of lines different. If the count is > X%, they are different. Not very robust and probably prone to error, but that'd be the quick hack I'd do for testing.

You might want to have a look at this page which discusses comparing two XML documents:
http://www.ibm.com/developerworks/xml/library/x-diff/index.html

An html document can be coerced into an XML document with beautiful soup then compared using the techniques listed there.

I had a similar problem. I was trying to devise a safe linking system for a directory of user submitted links. A user would publish a page on a blog or news site and submit the link to the index. A human would verify the link to be appropriate then add the page into the index.

The problem was to come up with a way to automate checks that ensured the link was still appropriate over time. For instance, did someone modify the page weeks later and insert racial slurs? Did the news site start telling people 'you must subscribe to read this story'?

I ended up extracting paragraph <p> elements and comparing the cached copy to the current word for word. In simplest terms:

After that, a series of sorters would work on it while ignoring common words 'if but can or and' while treating other words (profanity, etc) with a heavier weight.

This resulted in a scoring system that would all but ignore minor edits and revisions (typos, sentence structure, etc) but quickly reveal if the content needed to be examined again. A score was then returned, scores above a threshold would be put in a queue for a human to re-verify.

This also helped to account for major cosmetic changes to the site. I would not trust it to run completely on its own, but it did do its job predictably well with a little help from humans. Admittedly, the system was not as efficient as it could have been as far as the methodology goes.


Beyond Engagement: Aligning Algorithmic Recommendations With Prosocial Goals

Much of the media we see online — whether from social media, news aggregators, or trending topics — is algorithmically selected and personalized. Content moderation addresses what should not appear on these platforms, such as misinformation and hate speech. But what should we see, out of the thousands or millions of items available? Content selection algorithms are at the core of our modern media infrastructure, so it is essential that we make principled choices about their goals.

The algorithms making these selections are known as “recommender systems.” On the Internet, they have a profound influence over what we read and watch, the companies and products we encounter, and even the job listings we see. These algorithms are also implicated in problems like addiction, depression, and polarization. In September 2020, Partnership on AI (PAI) brought together a diverse group of 40 interdisciplinary researchers, platform product managers, policy experts, journalists, and civil society representatives to discuss the present and future of recommender systems. This unique workshop on recommender-driven media covered three topics:

  • How recommenders choose content today.
  • What should be the goal of recommenders, if not audience engagement?
  • Emerging technical methods for making recommenders support such goals.

Several promising directions for future recommender development emerged from the workshop’s presentations and subsequent discussions. These included: more understandable user controls, the development of survey-based measures to refine content selection, paying users for better data, recommending feeds not items, and creating a marketplace of feeds. The workshop also resulted in the first-ever bibliography of research articles on recommender alignment, as contributed by workshop participants.

How recommenders choose content

Recommender systems first emerged in the mid-1990s to help users filter the increasing deluge of posts on Usenet, then the main discussion forum for the fledgling Internet. One of the very first systems, GroupLens , asked users for “single keystroke ratings” and tried to predict which items each user would rate highly, based on the ratings of similar users. Netflix’s early recommender systems similarly operated on user-contributed star ratings. But it proved difficult to get users to rate each post they read or movie they watched, so recommender designers began turning to signals like whether a user clicked a headline or bought a product. By the mid-2000s, systems like Google News relied on a user’s click history to select personalized information.

Today’s recommender systems use many different kinds of user behavior to determine what to show each user, from clicks to comments to watch time. These are usually combined in a scoring formula which weights each type of interaction according to how strong a signal of value it’s thought to be. The result is a measure of “engagement,” and the algorithmic core of most recommender systems is a machine learning model that tries to predict which items will get the most engagement.

Engagement is closely aligned to both product and business goals, because a system which produces no engagement is a system which no one uses. This is true regardless of the type of content on the platform (e.g. news, movies, social media posts) and regardless of business model (e.g. ads, subscriptions, philanthropy). The problem is that not everything that is engaging is good for us — an issue that has been recognized since the days of sensationalized yellow journalism . The potential harmful effects of optimizing for engagement, from the promotion of conspiracy theories to increased political polarization to addictive behavior, have been widely discussed , and the question of whether and how different platforms are contributing to these problems is complex .

Even so, engagement dominates practical recommendations, including at public-interest news organizations like the BBC . Sometimes high engagement means the system has shown the user something important or sparked a meaningful debate, but sensational or extreme content can also be engaging. Recommender systems need more nuanced goals, and better information about what users need and want.

Building better metrics

Most modern AI systems are based on optimization, and if engagement is not a healthy objective then we need to design better processes for measuring the things we care about. The challenge that recommender designers face is expressing high-level concepts and values in terms of low-level data such as clicks and comments.

From a talk by Rachel Thomas, Director, USF Center for Applied Data Ethics

There’s a huge gap between the high-level qualities on the left side of the above graphic and the low-level data on the right, which includes user clicks and likes, the digital representation of the content itself, and metadata such as user and item location. The concepts we care about have to be operationalized — that is, translated from abstract ideas to something that can be repeatedly measured — before they can be used to drive AI systems. As a simple example, the “timeliness” of an item can be defined so that posts within the last day or the last week are considered most timely and gradually age out. Recent PAI research analyzed how Facebook operationalized the much more complex concept of a “meaningful social interaction,” and how YouTube operationalized “user satisfaction” as something more than just the time spent watching videos.

More complex ideas like “credibility” or “diversity” have proven quite difficult to translate into algorithmic terms. The News Quality Initiative (NewsQ) has been working with panels of journalists and technologists to try to define appropriate goals for content selection. As one of the journalists put it , “any confusion that existed among journalists regarding principles, standards, definitions, and ethics has only travelled downstream to platforms.”

The NewsQ panel studying opinion journalism recommended that “opinion” content should be clearly labelled and separated from “news” content, but noted that these labels are not used consistently by publishers. The NewsQ analysis of local journalism counted the number of news outlets which appeared in the top five stories from each place, e.g. “in the Des Moines feed we reviewed, 85 of 100 [top five] articles were pulled from four outlets.” The report calls for increased outlet diversity, but does not specify what an acceptable number of outlets in the top five results would be. While there is a deep history of journalistic practice and standards that can guide the design of news recommenders, defining a consensus set of values and translating them into algorithmic terms remains a challenge.

Even well-chosen metrics suffer from a number of problems. Using a metric as a goal or incentive changes its meaning, a very general problem sometimes known as Goodhart’s law . Metrics also break down when the world changes, just as a number of machine learning models stopped working when COVID reshaped the economy. And of course, qualitative research is essential: If you don’t know what is happening to your users, you can’t know that you should be measuring something new. Still, metrics are indispensable tools for grappling with scale.

As is true of AI in general , many of the problems with recommenders can be traced to mismatches between a theoretical concept and how it’s operationalized. For example, early news recommender systems operationalized “valuable to user” as “user clicked on the headline.” Clicks are indeed a signal of user interest, but what we now call “clickbait” lives entirely in the difference between user value and user clicks.

Controls and surveys

Many of the potential problems with recommenders might be alleviated by giving users more control. Yet few users actually use controls when offered: workshop participants who run recommenders noted that only one or two percent of their users actually adjust a control. It’s possible that this is because the controls that have been offered so far aren’t that well-designed or effective. For example, it’s not immediately obvious what will happen when you click on “see less often” on Twitter or “hide post” on Facebook. Better feedback on what such controls do might encourage their use, and one interesting idea is to show users a preview of how their feed will change.

Providing better control over content selection is crucial because it gives users agency, and also a kind of transparency as controls reveal something of how the underlying algorithm works. Yet even if controls were ten times as popular, most users would still not be using them. This means that a recommender’s default settings need to account for goals other than engagement, using some other source of data.

Surveys can offer much richer data than engagement because a survey question can ask almost anything. Surveys can help clarify why people use products the way they do, and how they feel about the results. Facebook asks users whether particular posts are “ worth your time ,” while YouTube sometimes asks users how satisfied they are with their recommendations. These are much more nuanced concepts than “engagement.” This sort of survey response data is usually used to train a machine learning model to predict people’s survey responses, just as recommenders already try to predict whether a user will click on something. Predicted survey answers are a nuanced signal that can be added into standard recommender scoring rules and directly weighted against other predicted behavior such as clicks and likes.

Despite their flexibility, there are a number of problems with using surveys to control content selection. The biggest problem is that people don’t come to platforms to fill out surveys. Too many surveys cause survey fatigue, where people become less likely to respond to surveys in the future. This severely limits the amount of data that can be collected through surveys, which makes the models constructed from survey responses far less accurate. Also, certain types of people are more or less likely to respond to surveys and this leads to survey bias. Opt-in surveys also don’t provide good data on individuals over time.

Many of these problems could be solved by paying a panel of users for more frequent feedback. It’s easy to tell if showing an article leads to a click or a share, but much harder to tell if it contributes to user well-being or healthy public discussion. To drive recommender behavior, we’ll need to know not just what users think about any particular item, but how their opinion changes over time as the algorithm adjusts to try to find a good mix of content.

Well-being metrics

AI developers are not the first people to think about the problem of capturing deep human values in metrics. Ever since GDP was introduced as a standardized measure of economic activity in 1934 it has come under criticism for being a narrow and myopic goal, and researchers have looked to replace it with more meaningful measures.

In the last decade, well-being measures have become increasingly used in policy-making. While the question “what is happiness?” is ancient, the 20th century saw the advent of positive psychology and systematic research into this question. Well-being is a multi-dimensional construct, and both subjective and objective measures are needed to get a clear picture. For example, the OECD Better Life Index includes both crime rates and surveys asking if people “feel safe walking alone at night.” To demonstrate how well-being metrics work, workshop participants took a survey which measured well-being across a variety of dimensions.

Well-being survey results for Recommender Alignment (RA) workshop participants (orange) and all other takers, provided by Happiness Alliance.

The IEEE has collected hundreds of existing well-being metrics that might be relevant to AI systems into a standard known as IEEE 7010 . It also advocates a method to measure the well-being effect of a product change: take a well-being measurement before and after the change, and compare that to the difference in a control group of non-users. Facebook’s “meaningful social interaction” research is also framed in terms of well-being.

The well-being metrics used in policy-making don’t provide any sort of grand answer to the question of what an AI system like a recommender should be trying to achieve, and they’re not going to be specific enough for many AI domains. But they do have the advantage of representing a normative consensus that has developed over decades, and they provide guidance on the types of human experience worth considering when designing a metric.

Recommender recommendations

Despite widely differing perspectives, concerns, and types of recommenders, a number of themes emerged from the presentations and discussions. The practice of recommender alignment is in its infancy, but there are a number of places where progress can be made in the near future.

While recommenders have offered user controls of one sort of another for many years, they still mostly aren’t used. That might change if the controls were better and not simply a button that says “I don’t want to see this item.” More expressive controls could adjust the proportions of different topics, or the degree to which internal signals like news source credibility estimates shape content ranking. It remains challenging to communicate to users what a control does and give appropriate feedback. One possibility is showing which items would be added or removed as the user adjusts a control, and there is great opportunity for experimentation.

Engagement isn’t going away, but we need better measures to capture what it misses. YouTube has begun incorporating data from user satisfaction surveys into recommendations, while Facebook uses several different types of questions. Carefully designed survey measures can provide critical feedback about how a recommender system is enacting high-level values. Standardized questions would provide guidance on what matters for recommenders, how to evaluate them, and allow comparison between different systems.

Voluntary survey responses have not produced enough data to provide accurate, personalized signals of what should be recommended. The next step is to pay users for more detailed data, such as by asking them to answer a question daily. Such data streams could provide rich enough feedback to attempt more sophisticated techniques for content selection, such as reinforcement learning methods which try to figure out what will make the user answer the survey question more positively.

Almost all production recommender systems are based on scoring individual items, then showing the user the top-ranked content. User feedback is used to train this item scoring process, but these models can’t learn how to balance a mix of different types of content in the overall feed. For example, a naive news ranker might fill all top 10 slots with articles about Trump because they get high engagement. Even if each of these stories is individually worthwhile, the feed should likely include other topics. Existing recommender algorithms and controls are almost entirely about items, instead of taking a more holistic view of the feed.

Each color represents a different type of content: posts from friends, political news, funny videos, etc. Recommendation algorithms currently rank each item against the others, as on the left. However, they should rank sets of items – whole feeds – to ensure a good mix of items overall, as on the right.
Image by Dylan Hadfield-Menell

Rather than trying to create one feed algorithm that fits everyone, we could have a variety of different feeds to serve different interests and values. If you trust the BBC, you might also trust a recommendation algorithm that they create, or their recommended settings for a particular platform. Similarly, you might trust a feed created by a doctor for health information. We normally think of media pluralism as being about a diversity of sources, but it may be important to have a diversity of feeds as well.

Further reading

This list of research articles was solicited from workshop attendees. It’s by no means comprehensive, but is meant to serve as an introduction to the field of recommender alignment, including the intersection of recommenders with social issues such as addiction and polarization.

What is recommender alignment?

Jonathan Stray, Steven Adler, and Dylan Hadfield-Mennell (2020)

Connects AI alignment to the practical engineering of recommender systems. Or see the paper video .

If people are getting science information (e.g. COVID19) through recommenders, then those algorithms need to understand what quality science is.

The definitive reference on AI alignment generally. Also see a five-minute video from Russell introducing the concept.

A moral and political philosophy analysis arguing that “values” for AI can only come from social deliberation.

A discussion of the challenges and opportunities for algorithmic news distribution, from the view of political theory.

Silvia Milano, Mariarosaria Taddeo and Luciano Floridi (2019)

A systematic analysis of the ethical areas of concern regarding recommender systems

How social choice theory on voting intersects with AI systems, and what problems it can’t solve.

Dylan Hadfield-Menell et. al.

Introduces inverse reward design (IRD) “as the problem of inferring the true objective based on the designed reward and the training.”

Smitha Milli, Luca Belli, Moritz Hardt (2020)

Determines how to weight different interactions on Twitter (e.g. retweet vs. fav) as signals of user “value” through latent variable inference from observed use of the “see less often” control.

A description of of YouTube’s deep learning-based ranking framework, and how it balances “engagement objectives” with “user satisfaction objectives.”

Lu, A. Dumitrache, D. Graus (2020)

A news organization successfully used their recsys to increase the diversity of audience reading.

A user study that tested sliders to control topics, algorithms, etc. on a recsys mockup.

Users appreciate recommender controls, even if they don’t work. Either way, people find creative workarounds to get what they want.

Jack Bandy, Nicholas Diakopoulos (2019)

An analysis of Apple News content in the human-curated vs. algorithmically curated sections

Mark Ledwich, Anna Zaitsev (2020)

Analysis of recommender traffic flows between different categories of more and less extreme content.

Consumption of far-right content on YouTube is consistent with broader patterns across the web, which complicates the causal role of recommender systems.

Participatory Design and Multi-stakeholder Recommenders

Researchers worked with multiple stakeholders in a donated food delivery non-profit, developing a quantitative ranking model from their input. Also a talk .

Samantha Robertson and Niloufar Salehi (2020)

Existing societal inequalities can constrain users’ ability to exploit algorithmically provided choices, for example due to a lack of information or the cost burden of choosing the “best” option.

A review of the field of multi-stakeholder recommendation algorithms, which attempt to simultaneously account for the needs of users, creators, platforms and the public.

Filter bubbles, polarization and conflict

Jennifer McCoy, Tahmina Rahman, and Murat Somer (2018)

One of the best reviews of why polarization is something we should care about: it can be a cause (not just a correlate) of democratic erosion.

Getting people to follow counter-ideological news sources on Facebook for a month slightly reduced polarization.

Measures polarization trends across nine OECD countries. Tends to disfavor the emergence of the internet and rising economic inequality as explanations.

In the US, growth in polarization in recent years is largest for the demographic groups least likely to use the internet and social media.

Reviews empirical investigations of filter bubbles etc. and argues that the concept is under-specified and the evidence for their existence is poor.

The most recent papers at the intersection of fairness and recommenders.

An analysis of artist exposure dynamics in Spotify’s music recommendations, and experiments with a popularity-based diversity metric.

A definition of recommender fairness in terms of pairwise rankings of items in different subgroups, and experiments in a production Google recommender.

Paying people to deactivate Facebook for four weeks caused slight increases in well-being measures and slight decreases in polarization measures.

Daria Kuss, Mark Griffiths (2017)

A review of the definition of “social networking addiction,” and the evidence for its existence, including demographic and contextual variations.

Betul Keles, Niall McCrae and Annmarie Grealish (2020)

Social media use is correlated with depression and anxiety. However, there are considerable caveats on causal inference due to methodological limitations.

A detailed report of how platforms use recommender systems and the public policy implications.

Jennifer Cobbe, Jatinder Singh (2019)

One of the most nuanced analyses of possible regulatory frameworks, focussing on “harmful” content and liability laws.

An analysis of recent Facebook and YouTube recommender changes, and how recommenders can be oriented toward well-being metrics.

Survey analysis, by Facebook researchers, of what makes a social interaction (online or off) “meaningful,” finding similarities across the U.S., India, and Japan.

Case study of multi-objective optimization in a corporate setting: metrics used at Groupon to integrate expected transaction profit, likelihood of purchase, etc.


Research on Sports Performance Prediction Based on BP Neural Network

Artificial neural network has the advantages of self-training and fault tolerance, while BP neural network has simple learning algorithms and powerful learning capabilities. The BP neural network algorithm has been widely used in practice. This paper conducts research on sports performance prediction based on 5G and artificial neural network algorithms. This paper uses the BP neural network algorithm as a pretest modelling method to predict the results of the 30th Olympic Men’s 100m Track and Field Championships and is supported by the MATLAB neural network toolbox. According to the experimental results, the scheme proposed in this paper has better performance than the other prediction strategies. In order to explore the feasibility and application of the BP neural network in this kind of prediction, there is a lot of work to be done. The model has a high prediction accuracy and provides a new method for the prediction of sports performance. The results show that the BP neural network algorithm can be used to predict sports performance, with high prediction accuracy and strong generalization ability.

1. Introduction

The artificial neural network is a kind of network information processing system inspired by a human brain. It has high nonlinear dynamic processing ability and does not need to know the distribution form and variable relation of data [1, 2]. When the input and output relationships of hybrid systems are too complex to be expressed in general terms, it is easy to realize their highly nonlinear mapping relationships by the God-Jing network [3, 4]. A neural network has achieved good results in pattern recognition, automatic control, and many other fields. In recent years, the neural network model has been successfully applied to economic prediction. For this reason, we use the BP learning algorithm of an artificial neural network to study the prediction of sports achievement [5, 6].

According to the existing sports results, the prediction of the sports results will be used in large-scale games, such as the Olympic Games, the Asian Games, the National Games, and so on. This prediction not only provides clear training and competition goals for athletes and coaches, at the same time, we can track and judge the development law and characteristics of sports achievement [7]. Therefore, this kind of forecast is always in an important position in the sports achievement forecast research. The more accurate the prediction of this kind of sports achievement is, and the direct influence is on the formulation of training and training objectives. This paper conducts research on sports performance prediction based on 5G and artificial neural network algorithms [8]. This paper uses the BP neural network algorithm as a pretest modelling method to predict the results of the 30th Olympic Men’s 100m Track and Field Championships and is supported by the MATLAB neural network toolbox. According to the results of forecasting research, artificial neural networks have advantages. In order to explore the feasibility and application of BP neural networks in this kind of prediction, there is a lot of work to be done [9].

The results show that the BP neural network algorithm can be used to predict sports performance with high prediction accuracy and strong generalization ability. It is also on the discovery of the law of achievement development and the characteristics of sports development [10]. However, due to the small amount of data on which this kind of prediction is based, the randomness of the data in the process of producing the data is large and there are many hidden influencing factors, which make the data information very uncertain [11]. Therefore, it is difficult to ensure the accuracy of this kind of prediction. Therefore, the selection of a suitable, high-precision prediction method will be the key to prediction.

How to evaluate the performance of sports is an issue worth studying, and many scholars have studied the evaluation method for course performance. The usual method is to make scoring rule, in which the judge marks according to the completion of the rule and adds up the scores. However, this method is greatly affected by the experience and level of the judge. The AHP method is widely used in various types of performance evaluation [12]. In comparision with the traditional direct scoring method, this method achieves significant progress, combining the qualitative evaluation with the quantitative evaluation and improves the accuracy of the evaluation. The BP (back propagation) network was proposed by the scientist team headed by Rumelhart and McCelland in 1986 [13, 14], which is a multilayer feed-forward network trained by using the error back-propagation algorithm. In this paper, we will take advantage of the BP network to establish a prediction model, which can improve the prediction performance.

The BP network is currently one of the most widely applied neural network models. The BP network can learn and store a large number of input-output model mapping without disclosing or describing the mathematical equation of this mapping in advance. This method has wide application prospect in the performance evaluation of sports aerobics. The BP algorithm can better describe the nonlinear relationship between sports performance and various factors [15]. Neural networks prediction, however, is limited in our research due to its large requirement on training samples and weak generalization ability.

Above all, although there are many methods to deal with sports performance prediction, the prediction time and prediction accuracy cannot meet the technical requirements. Therefore, a new prediction algorithm is needed to improve the prediction performance. On this basis, we use the BP neural network algorithm to predict the sport performance.

The contributions of this paper are summarized as follows: (1) This paper proposes a new pretest modelling method which combines the BP algorithm and 5G (2) This paper uses the combined algorithm as a pretest modelling method to predict the results of the 30th Olympic Men’s 100m Track and Field Championships and is supported by the MATLAB neural network toolbox

This paper is organized as follows. Section 2 presents some related work. Section 3 gives the method to establish the prediction network. In Section 4, experiment is presented and analyzed. Section 5 gives the training result and result analysis. Finally, Section 6 sums up some conclusions and gives some suggestions as the future research topics.

2. Related Work

Based on the existing sports performance, the prediction of sports performance will be used in the Olympic Games, Asian Games, National Games, and other major sports events [16, 17]. This prediction not only provides clear training and competition goals for athletes and coaches, at the same time, we can track and judge the development laws and characteristics of sports achievements. In the 1940s, an artificial neural network was first proposed [18]. It formed a new machine learning method theory by imitating the process of human brain processing problems and the method of solving problems. From the structural point of view, the artificial neural network is composed of some neurons, mainly simulating the interaction between these neurons and embedding the action mode into the network structure [19].

The main feature of artificial neural networks is parallel data processing. Although the structure of a single neuron is relatively simple, the structure formed by combining a large number of neurons is still very complicated. Yan et al. [20] provided a basis for the application of neural network modelling in biomechanics and opened up a broad prospect for research in this area. They took a shot put as an example and used neural network technology to establish a transformation model between feature quantity and original information [21]. In addition, some scholars have explored the generalized inverse transformation of sports biomechanics information [22]. The original data restoration effect is better. Essentially speaking, the neural network eventually acquires knowledge through learning, so it can be used to establish a more complex model of causality in human motion.

3. Methods

The BP neural network is a multilayer pre-error feedback neural network, which belongs to the error back-propagation algorithm [23, 24]. It consists of an input layer, an output layer, and several hidden layers, each of which has several nodes, each node representing a neuron, and the upper and lower nodes are connected by weight. The nodes between the layers are all interconnected, and there is no correlation between the nodes in each layer [25, 26].

3.1. Information Forward Propagation Process

In the BP neural network, the differentiable function of sigmoid strong bending type, that is, strict incrementally, can make the output show a better balance between linearity and nonlinearity, so the nonlinear mapping between input and output can be realized. It is suitable for medium- and long-term forecasts. It has the advantages of a good approximation effect, fast calculation speed, and high precision [27, 28]. At the same time, its theoretical basis is solid, the derivation process is rigorous, the formula is symmetrical and graceful, and it has a strong nonlinear fitting ability. The neural network model of the hidden layer is a linear or nonlinear regression model. It is generally believed that increasing the number of hidden layers can reduce network error. Of course, it also complicates the network and increases the training time and the tendency of “overfitting.” Therefore, a 3-layer BP network (that is, 1 hidden layer) was used in this study [29, 30].

The number of nodes in the hidden layer is not only related to the number of nodes in the input and output layers but also to the complexity of the problem to be solved, the type of transfer function, and the characteristics of the sample data. The condition of determining the number of hidden layer nodes is that the number of nodes in and out layer and hidden layer must be less than n-1 (where n is the number of training samples) [31]. The input and output of all kinds of training samples in this study are all 5 and 1, so the number of nodes in the hidden layer is determined to be 3. On the basis of determining various parameters, a neural network was established, and through the training of the neural network, the results of 100,200,400 people in the 30th Olympic Games were predicted, and the results of 7 events were predicted by the following methods: rolling prediction [32, 33]. That is to say, the results of the 23rd and 27th sessions are used to predict the results of the 28th session, the 2400th session to predict the results of the 29th session, and the 25th session of the 29th session to predict the results of the 30th session so as to form the rotation training and repeat it until the full precision requirement of prediction is fulfilled [34, 35]. And the achievement that satisfies the accuracy, namely, the forecast result that wants to obtain for the forecast is shown in Figure 1.


1 Answer 1

A little understanding of the actual meanings (and mechanics) of both loss and accuracy will be of much help here (refer also to this answer of mine, although I will reuse some parts).

For the sake of simplicity, I will limit the discussion to the case of binary classification, but the idea is generally applicable here is the equation of the (logistic) loss:

  • y[i] are the true labels (0 or 1)
  • p[i] are the predictions (real numbers in [0,1]), usually interpreted as probabilities
  • output[i] (not shown in the equation) is the rounding of p[i] , in order to convert them also to 0 or 1 it is this quantity that enters the calculation of accuracy, implicitly involving a threshold (normally at 0.5 for binary classification), so that if p[i] > 0.5 , then output[i] = 1 , otherwise if p[i] <= 0.5 , output[i] = 0 .

Now, let's suppose that we have a true label y[k] = 1 , for which, at an early point during training, we make a rather poor prediction of p[k] = 0.1 then, plugging the numbers to the loss equation above:

  • the contribution of this sample to the loss, is loss[k] = -log(0.1) = 2.3
  • since p[k] < 0.5 , we'll have output[k] = 0 , hence its contribution to the accuracy will be 0 (wrong classification)

Suppose now that, an the next training step, we are getting better indeed, and we get p[k] = 0.22 now we have:

  • loss[k] = -log(0.22) = 1.51
  • since it still is p[k] < 0.5 , we have again a wrong classification ( output[k] = 0 ) with zero contribution to the accuracy

Hopefully you start getting the idea, but let's see one more later snapshot, where we get, say, p[k] = 0.49 then:

  • loss[k] = -log(0.49) = 0.71
  • still output[k] = 0 , i.e. wrong classification with zero contribution to the accuracy

As you can see, our classifier indeed got better in this particular sample, i.e. it went from a loss of 2.3 to 1.5 to 0.71, but this improvement has still not shown up in the accuracy, which cares only for correct classifications: from an accuracy viewpoint, it doesn't matter that we get better estimates for our p[k] , as long as these estimates remain below the threshold of 0.5.

The moment our p[k] exceeds the threshold of 0.5, the loss continues to decrease smoothly as it has been so far, but now we have a jump in the accuracy contribution of this sample from 0 to 1/n , where n is the total number of samples.

Similarly, you can confirm by yourself that, once our p[k] has exceeded 0.5, hence giving a correct classification (and now contributing positively to the accuracy), further improvements of it (i.e getting closer to 1.0 ) still continue to decrease the loss, but have no further impact to the accuracy.

Similar arguments hold for cases where the true label y[m] = 0 and the corresponding estimates for p[m] start somewhere above the 0.5 threshold and even if p[m] initial estimates are below 0.5 (hence providing correct classifications and already contributing positively to the accuracy), their convergence towards 0.0 will decrease the loss without improving the accuracy any further.

Putting the pieces together, hopefully you can now convince yourself that a smoothly decreasing loss and a more "stepwise" increasing accuracy not only are not incompatible, but they make perfect sense indeed.

On a more general level: from the strict perspective of mathematical optimization, there is no such thing called "accuracy" - there is only the loss accuracy gets into the discussion only from a business perspective (and a different business logic might even call for a threshold different than the default 0.5). Quoting from my own linked answer:


Automated Machine Learning (AutoML) Walkthrough

The DataRobot Automated Machine Learning product accelerates your AI success by combining cutting-edge machine learning technology with the team you have in place. The platform incorporates the knowledge, experience, and best practices of the world's leading data scientists, delivering unmatched levels of automation, accuracy, transparency, and collaboration to help your business become an AI-driven enterprise.

Use Case

This guide will demonstrate the basics of how to build, select, deploy, and monitor a regression or classification model using the automated machine learning capabilities of DataRobot. The potential application of these capabilities spans many industries: banking, insurance, healthcare, retail, and many more. The use case highlighted throughout these examples comes from the healthcare industry.

Healthcare providers understand that high hospital readmission rates spell trouble for patient outcomes. But excessive rates may also threaten a hospital’s financial health, especially in a value-based reimbursement environment. Readmissions are already one of the costliest episodes to treat, with hospital costs reaching $41.3 billion for patients readmitted within 30 days of discharge, according to the Agency for Healthcare Research and Quality (AHRQ).

The training dataset used throughout this document is from a research study that can be found online at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3996476/. The resulting models predict the likelihood that a discharged hospital patient will be readmitted within 30 days of their discharge.

Automated Regression & Classification Modeling

STEP 1: Load and Profile Your Data

To get started with DataRobot, you will log in and load a prepared training dataset. DataRobot currently supports .csv, .tsv, .dsv, .xls, .xlsx, .sas7bdat, .bz2, .gz, .zip, .tar, and .tgz file types, plus Apache Avro, Parquet, and ORC (Figure 1). (Note: If you wish to use Avro or Parquet data, contact your DataRobot representative for access to the feature.)

These files can be uploaded locally, from a URL or Hadoop/HDFS, or read directly from a variety of enterprise databases via JDBC. Directly loading data from production databases for model building allows you to quickly train and retrain models, and eliminates the need to export data to a file for ingestion.

DataRobot supports any database that provides a JDBC driver—meaning most databases in the market today can connect to DataRobot. Drivers for Postgres, Oracle, MySQL, Amazon Redshift, Microsoft SQL Server, and Hadoop Hive are most commonly used.

After you load your data, DataRobot performs exploratory data analysis (EDA) which detects the data types and determines the number of unique, missing, mean, median, standard deviation, and minimum and maximum values. This information is helpful for getting a sense of the dataset shape and distribution. DataRobot also creates a Data Quality Assessment, which flags potential problems in your data.

Start Modeling

STEP 2: Select a Prediction Target

Next, select a prediction target (the name of the column in your data that captures what you are trying to predict) from the uploaded database (Figure 2). DataRobot will analyze your training dataset and automatically determine the type of analysis (in this case, classification).

Figure 2. Target Selection

DataRobot automatically partitions your data. If you want to customize the model building process, you can modify a variety of advanced parameters, optimization metrics, feature lists, transformations, partitioning, and sampling options. The default modeling mode is “Quick" Autopilot, which employs an effective and efficient use of DataRobot’s automation capabilities. For more control over which algorithms DataRobot runs, there are Manual and full Autopilot options. You can even configure some of the default processes carried out by Autopilot.

STEP 3: Begin the Modeling Process

Click the Start button to begin training models. Once the modeling process begins, the platform further analyzes the training data to create the Importance column (Figure 3). This Importance grading provides a quick cue to better understand the most influential variables for your chosen prediction target.

Target Leakage

If DataRobot detects target leakage (i.e., information that would probably not be available at the time of prediction), the feature is marked with a warning flag in the Importance column. If the leaky feature is significantly correlated with the target, DataRobot will automatically omit it from the model building process. It also might flag features that suggest partial target leakage. For these features, you should ask yourself whether the information would be available at the time of prediction if it will not, then remove it from your dataset to limit the risk of skewing your analysis.

You can easily see how many features contain useful information, and edit feature lists used for modeling. You can also drill down on variables to view distributions, add features, and apply basic transformations.

DataRobot Modeling Strategy

DataRobot supports popular advanced machine learning techniques and open source tools such as Apache Spark, H2O, Scala, Python, R, TensorFlow, Facebook Prophet, Keras, DeepAR, Eureqa, LightGBM, and XGBoost. During the automated modeling process, it analyzes the characteristics of the training data and the selected prediction target and selects the most appropriate machine learning algorithms to apply. DataRobot optimizes data automatically for each algorithm, performing operations like one-hot encoding, missing value imputation, text mining, and standardization to transform features for optimal results.

DataRobot streamlines model development by automatically building and ranking models (including blenders/ensembles of models) based on the techniques advanced data scientists use, including boosting, bagging, random forests, kernel-based methods, generalized linear models, deep learning, and many others. By cost-effectively evaluating a near-infinite combination of data transformations, features, algorithms, and tuning parameters in parallel across a large cluster of commodity servers, DataRobot delivers the best predictive model in the shortest amount of time.

STEP 4: Evaluate the Results of Automated Modeling

After automated modeling is complete, the Leaderboard will rank each machine learning model so you can evaluate and select the one you want to use (Figure 4). The model that is "Recommended for Deployment” is tagged so that you can begin your analysis there, or just move it forward to deployment.

When you select a model, you see options for Evaluate, Understand, Describe, and Predict. To estimate possible model performance, the Evaluate options include industry standard Lift Chart, ROC Curve, Confusion Matrix, Feature Fit, and Advanced Tuning. You can even optimize your binary classification problems using the Profit Curve tool that allows you to add custom values to specific outcomes. There are also options for measuring models by Learning Curves, Speed versus Accuracy, and Comparisons. The interactive charts to evaluate models are very detailed, but don't require a background in data science in order to understand what they convey.

Transparency

STEP 5: Review how your Chosen Model Works

DataRobot offers superior transparency, interpretability, and explainability so you easily understand how models were built, and have the confidence to explain why a model made the prediction it did.

In the Describe tab, you can view the end-to-end model blueprint containing details of the specific feature engineering tasks and algorithms DataRobot uses to run the model (Figure 5). You can also review the size of the model and how long it ran, which may be important if you need to do low-latency scoring.

In the Understand tab, popular exploratory capabilities include Feature Impact, Feature Effects, Prediction Explanations, and Word Cloud. These all help you understand what drives the model’s predictions.

Interpreting Models: Global Impact

Feature Impact measures how much each feature contributes to the overall accuracy of the model (Figure 6). For example, the reason why a patient was discharged from a hospital is directly related to the likelihood of a patient being readmitted. This insight can be invaluable for guiding your organization to focus on what matters most.

The Feature Effects chart displays model details on a per-feature basis (a feature's effect on the overall prediction), depicting how a model understands the relationship between each variable and the target (Figure 7). It provides specific values within each column that are likely large factors in determining whether someone will be readmitted to the hospital.

Interpreting Models: Local Impact

Prediction Explanations reveal the reasons why DataRobot generated a particular prediction for a data point so you can back up decisions with detailed reasoning (Figure 8). They provide a quantitative indicator of variable effect on individual predictions.

Figure 8. Prediction Explanations

The Insights tab provides more graphical representations of your model. There are tree-based variable rankings, hotspots, and variable effects to illustrate the magnitude and direction of a feature's effect on a model's predictions, and also text mining charts, anomaly detection, and a word cloud of keyword relevancy.

Interpreting Text Features

The Word Cloud tab provides a graphic of the most relevant words and short phrases in a word cloud format (Figure 9). The tab is only available for models trained with data that contains unstructured text.

STEP 6: Generate Model Documentation

DataRobot can automatically generate model compliance documentation—a detailed report containing an overview of the model development process, with full insight into the model assumptions, limitations, performance, and validation detail. This feature is ideal for organizations in highly regulated industries that have compliance teams that need to review all aspects of a model before it can be put into production. Of course, having this degree of transparency into a model has clear benefits for organizations in any industry.

Making Predictions

STEP 7: Make Predictions

Every model built in DataRobot is immediately ready for deployment. You can:

A. Upload a new dataset to DataRobot to be scored in batch and downloaded (Figure 10).

Figure 10. Make Predictions

B. Create a REST API endpoint to score data directly from applications (Figure 11). An independent prediction server is available to support low latency, high throughput prediction requirements.

C. Export the model for in-place scoring in Hadoop (Figure 12). (Note that in-place scoring on Hadoop is not available for Managed AI Cloud deployments.)

D. Download scoring code, either as editable source code or self-contained executables, to embed directly in applications to speed up computationally intensive operations (Figure 13).

Monitor and Manage your models

STEP 8: Monitor and Manage Deployed Models

With DataRobot you can proactively monitor and manage all deployed machine learning models (including models created outside of DataRobot) to maintain peak prediction performance. This ensures that the machine learning models driving your business are accurate and consistent throughout changing market conditions.
At a glance you can view a summary of metrics from all models in production, including the number of requests (predictions) and key health statistics:

  • Service Health looks at core performance metrics from an operations or engineering perspective: latency, throughput, errors, and usage (Figure 14).
    Figure 14. Service Health
  • Data Drift proactively looks for changes in the data characteristics over time to let you know if there are trends that could impact model reliability (Figure 15).

You can also analyze data drift to assess if the model is reliable, even before you get the actual values back. You’re essentially analyzing how the data you’ve scored this model on differs from the data the model was trained on. DataRobot compares the most important features in the model (as measured by its Feature Impact score) and how different each feature’s distribution is from the training data.

Green dots indicate features that haven't changed much. Yellow dots indicate features that have changed but aren't very important. You should examine these, but changes with these features don't necessarily mandate action, especially if you have lots of models. Red dots indicate important features that have drifted. The more red dots you have, the greater the likelihood that your model needs to be replaced.

From here you can apply “embedded DataRobot data science” expertise to review model performance and detect model decay. By clicking on a model you can see how the predictions the model has made have changed over time. Dramatic changes here can indicate that your model has gone off track.

Replacing a Model

If you decide to replace a model that’s drifted, simply paste the URL from a re-trained DataRobot model (a model trained on more recent data from the same data source), or from one that has compatible features. After DataRobot validates that the model matches you can select a reason why you made the replacement for a permanent archive. From this point forward, new prediction requests will go against the new model with no impact to downstream processes. If you ever decide to restore the previous model, you can easily do that through the same process.

Prediction Applications

Once you have deployed a model you can launch a prediction application. Simply go to the Applications tab and select an application (Figure 17).

This example shows how to launch the Predictor application. The first step is to click Launch in the Applications Gallery and fill out the fields below. Click Model from Deployment and indicate the deployment form which you want to make predictions (Figure 18). Then click Launch . You can also create and evaluate different scenarios based on your model with the What-If application.

Figure 18. Launch Deployment

Once the launch is complete you will be taken to your Current Applications page. You will see your application listed this may take a moment to finish. Now you can open the application and make predictions by filling out the relevant fields (Figure 19).

Figure 19. Create New Record

Conclusion

DataRobot’s regression and classification capabilities are available as a fully-managed software service (SaaS), or in several Enterprise configurations to match your business needs and IT requirements. All configurations feature a constantly expanding set of diverse, best-in-class algorithms from R, Python, H2O, Spark, and other sources, giving you the best set of tools for your machine learning and AI challenges.

(This community post references research from this article: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, “Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,” BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014. )


For an overall explanation of how ROC curves are computed consider this excellent answer: https://stats.stackexchange.com/a/105577/112731

To your question: first, if you want to compare different approaches, comparing their ROC curves and area under curve (AUC) values directly will be a good idea, as those give you overall information about how powerful your approaches are on your problem.

Second: you will need to choose a threshold appropriate for your goal. The tradeoff with this is that you will need to decrease one of the TPR (true positive rate, or sensitivity), or TNR (true negative rate, or specificity) in order to increase the other - there is no way around this$^1$. So, depending on your problem, you might e.g. happen to need a low false positive rate (FPR = 1 - TNR), which in turn will require you to have a high TNR - so this will definitely depend on details of your problem.

Having said this, to choose a threshold you will usually look at both the ROC curve and the distribution of TPR and TNR over the threshold. Those should provide the required information for you to choose a reasonable tradeoff. As you want to do this in R, here's a minimal example of how this could look like:

So in this example, for about equal TPR and TNR, you would want to choose a threshold around 0.5. If you would want to e.g. have very low FPR you would want to choose a lower threshold instead. After choosing a threshold , you can use the predicted class probabilities to immediately determine the predicted class:

$^1$ For completeness: predicted class probabilities from your model are made either a "positive" prediction (usually above the threshold) or a "negative" prediction (usually below the threshold) by this.


Algorithms should contribute to the Happiness of Society

The following is the opening speech of Arjan van den Born, Academic Director of the Jheronimus Academy of Data Science (JADS), that we worked on for the Den Bosch Data Week of which I am the cofounder and curator.

Before we start our discussions, can I ask you: Who of you is happy? Can you raise your hand? Who of you is not happy? Here I want to teach you two key aspects of happiness. First it is not easy to determine whether your happy, but you know it when you are not happy. Second, it feels awkward to say in public that you’re not happy. That is still somewhat of a taboo. That is remarkable in a world where stress and burnout are the epidemic diseases of our time.

Let’s continue with a third question. Who has ever met a person that does not want to be happy? No one, right? Happiness is like the holy grail of life — the ultimate goal and desire of every human being. Happiness balances mind and matter, spirit and experience, yin and yang — being happy means being on the right track. Happiness is even related to hormones that make us feel good. That’s why all people want to find happiness. But not all of us find it — we either don’t know where to look, for what to look, or how to find it.

Today I will tell you a story on how we as data scientist, and JADS as an institution can contribute to your happiness. And what we should avoid doing, so we won’t make you less happy.

Happiness as a human pursuit is ingrained in our actions for as long as we can remember. The old Greek philosophers already discussed the great importance of happiness. Since then, many scientists followed the footsteps of Aristotle and Plato. A special mention should go to the utilitarians of the 18th and 19th century. They not only discussed the importance of happiness, but also introduced happiness as a moral compass. They advocated the greatest happiness principle as our guide for ethical behavior. Jeremy Bentham stated that actions are moral when they promote happiness. The thinking of these utilitarians has even been written down in the United States Declaration of Independence of 1776, in which Thomas Jefferson wrote about the universal human right to pursue happiness.

Although happiness has always been an important theme, its significance seems to be increasing in today’s world. More and more individuals are becoming aware of the great importance of pursuing happiness. Perhaps it is the loss of religion in Western society that makes us more interested in being happy. Or is it that we are losing some of our ability to be happy in this modern age with all its distractions and information overload?

Whatever the reasons, the modern homo sapiens seems to be gripped by the hunt of happiness. No wonder that the most popular course in the whole history of Harvard University is the course Positive Psychology 1504, a.k.a. the happiness course. Here students are taught on how to be happy in 22 lectures about all the psychological aspects of life fulfillment and flourishing, by teaching about empathy, friendship, love, achievement, creativity, spirituality, happiness, and, off course, humor. But not only students are actively hunting for happiness. It seems that increasingly all individuals are pursuing happiness with greatest fervor. We even have magazines called Happinez, coaches who seem to be able to train us in happiness, and holidays and retreats, which all seem to promise happiness.

And it is not only individuals that chase for happiness. Organizations, cities, and even nations are pursuing happiness. Bhutan was the first country, which aimed to maximize its Gross National Happiness. A great many countries, regions and cities, have followed their example. Since last year we are also measuring happiness on a municipal level. According to these measures the happiest municipalities in the Netherlands are Ede, Alphen a/d Rijn, and Amstelveen. And the least happy municipalities are Rotterdam and Amsterdam. Out of 50 bigger cities in the Netherlands, Den Bosch scored 19th, Tilburg came in 32nd and Eindhoven appeared at spot 38. Looking at these data perhaps one could say that there is an inverse correlation between football results of the local professional football club and the happiness of its citizens. But that would say nothing about causation.

“Happiness is possible and can be ‘engineered”

As happiness is becoming more important in our society, it should not be a surprise that this subject fascinates many scientists. Happiness is nowadays a serious field of study. Here I want to pay tribute to one of the pioneers and world authorities on the scientific study of happiness Ruut Veenhoven. He is the founding director of the World Database of Happiness and has been described as “the godfather of happiness studies”. Of all his findings, one conclusion has struck me the most. He concludes: “Happiness is possible and can be ‘engineered’.

This is where we come in. JADS is about building two bridges. The first bridge is between engineering and social sciences, and the second bridge is the bridge between society and science. Happiness is spot on this crossroads. It is something where we should and can contribute. It is one of the most important societal questions and a challenge where modern, engineering based methods brings us further.

Then the question is how can engineering, data science and artificial intelligence contribute to our happiness? Here we are much inspired by Frans de Waal, one of the world’s finest primatologist and, a local boy born-and-raised here in ’s-Hertogenbosch. He does not see himself as a primatologist, but rather as a psychologist specialized in primates. He once pondered that his field of study, the psychology of primates, has developed tremendously since he started working at the end of the 1970s. In the same period, “normal” psychology, the psychology of humans, has made little progress. At least by comparison with the great advances in primatology. Frans de Waal argued that perhaps the main explanation for this progress in primate psychology has been the absence of opportunity. Where “normal”, i.e. human psychologists were able to ask humans how they are feeling and why they are doing what they do this was simply impossible for someone studying primates. To overcome this burden, primatologists started observing and measuring. They started observing which primates smile, when do they smile, in what kind of groups do primates smile. And they started measuring for instance the stress levels in bodies.

The measurement of objective characteristics instead of gauging subjective, social desirable answers too often found in interviews and surveys, paved the way for the enormous advance of our understanding of primatology. This progress has been so rapid that many scholars argue that human psychology nowadays can learn from primate psychology. For instance Meyer and Hamel argue that studies of stress in nonhuman primates can now play a valuable role in helping to elucidate the mechanisms underlying the role of stress in humans.

The example of Frans de Waal shows the power of measuring. This should not be a huge surprise. Many of the big scientific breakthroughs are related to increases in our ability to measure. Without the microscope of Anton van Leeuwenhoek we would not have known anything about the complex world of microbes. Up until that point, people had no idea that there was a whole world of microorganisms too small to be seen by the naked eye. With this discovery, it became possible to start learning about, among other things, the causes of diseases. And this list goes on and on. New instruments lead to novel measurements that lead to the confirmation or refutation of existing theories, but also contributed heavily to the development of new theories.

In the early days of Frans de Waal the research of primates was painstakingly tedious and labor-intensive. Every movement, every smile and every change in position of apes needed to be painstakingly observed, coded, and analyzed. To identify correlations and causations of specific primate behavior often took many years and years of study. The progress was slow and costly and only through determination and persistence of many researchers, we learned a lot about primates, and ourselves.

Here is where IoT, data science and AI can contribute. With sensors measuring positions, interactions, and health and with cameras observing the actual behavior and expressions of primates the costs of collecting and analyzing objective, real data are becoming quite low. In fact the costs will be lower than the cost of obtaining subjective data trough surveys and interviews. These new objective measurements will help us understand the complexities of being and pursuing happiness.

Our research project Music-As-A-Medicine, performed in close collaboration with Erasmus Medical Center, IMEC, Deloitte, and the Rotterdam Philharmonic Orchestra shows one of the ways in which sensors may help. Here researchers of Erasmus and others have already established the beneficial relationship between music and health. In a clinical trial, patients experiencing a specific type of surgery get a green card or a red card. Those with the red card just get the surgery and those with the green card get the surgery plus a headphone with soothing music (Mozart). Early findings indicate that those patients with headphones recover more quickly and more robustly with less relapses. This confirms studies in other settings. We already know that early-born baby’s in an incubator grow more quickly and healthy with music than without music.

But there are things we still do not understand. We do not understand which kind of music is most beneficial. Is it Mozart, Mendeleev or Metallica? Or depends this on the type of person? And are all persons equally receptive? Or are some persons more sensitive towards the effects of music than others? To gain a better understanding of these differences we are planning a grand experiment in which the Rotterdam Philharmonic will play 10 different pieces of music to over 2000 persons with 3 types of sensors (e.g. blood pressure, heart rate variability) to get a better understanding of the relationship between music and stress. Will this experiment lead to new insights? We will never know, until we try.

The experiment described above points to one area where data science may lead to new discoveries the complex science of personalization. That we are a diverse species is well known by the marketeers of this world. A famous saying in marketing is: “The quickest way to ruin a customer experience is to treat everyone the same”. People are simply not the same. It is not, as Ford used to think, that we all want a black car. No we want different colors and different types of cars. We want sport cars and SUV’s and we want them in red, blue and silver — and 100s of different colors. While in the world of marketing analytics this is well known, in many other fields, it looks like we are just beginning to discover our inherently human diversity. Increasingly we know that different people may require different types of medicine depending on their gender, age, environment, and their DNA. Here data science may come to the rescue.

To be clear, I am not promoting a “Big Brother is Watching You” kind of schemes. I have great troubles to really believe in initiatives with lofty goals and where technology and engineering alone are touted as the answer. For instance, Dubai aims to become “The happiest city on earth”. While such objectives are quite admirable, I do not feel that big data technologies such as sensors placed throughout the city and integrated with all sorts of feedback mechanisms will lead to more happiness. On the contrary: what is happening in China with its development of a “social credit” scoring system is as much appalling as it is frightening.

The remarks made last week in Brussels by Apple’s CEO Tim Cook on the emerging data industrial complex are spot on. He mentioned that a world where likes & dislikes, fears & wishes, and hopes & dreams are traded like commodities should unsettle us. A world where businesses embrace the data surveillance business model and harvest data routinely to nudge us into appropriate behavior maybe technologically possible, it is socially not desirable. Here Tim Cook tweeted suitably: “Technology is capable of doing great things. But its does not want to do great things. It does not want anything.” If we want technology to help us to become better societies, better families and even, better versions of ourselves, we need to take control over technology. Not the other way round.

“algorithms should always contribute to the happiness of society”

To be clear: data surveillance will not make us happier. Cities with cameras that know if you’re happy, or not, (or if you considered a threat) should scare any citizen. I believe that academia should lead by example. I therefore agree wholeheartedly with my colleague Wil van der Aalst in his drive for Responsible Data Science. We need to ensure that our science and our developed algorithms are fair, accurate, confidential and transparent. I suggest that we add John Stuart Mill’s greatest happiness principle to these existing four criteria of responsible data science. That is: algorithms should always contribute to the happiness of society. Only when data science meets all five characteristics fairness, accuracy, confidentiality, transparency, and happiness we might keep and rebuild the trust of society in algorithms and other data-powered services.

Given all the above, and coming back to our subject how can data science lead to more happiness? I think that data science will add to our happiness if we follow the great example of Frans de Waal and use objective, measurable data to gain a better understanding of humans. We can use novel measurements to build better theories and give us better, practical advice. Not only in the field of happiness, but focus on measurement will help us understanding of a lot of things better. Some rather mundane issues, for instance how to recognize and develop talent and the really important stuff, of what causes and cures particular diseases like Alzheimers and Cancer. Ofcourse we can use the power of statistics to promote happiness and diminish stress and burnout.

But there is a caveat, we should not follow the examples of China or Dubai. We do not believe it is necessary to measure all of the things all of the time. We must always remember that data is just there to help us answer questions. And it is up to human creativity to ask these right questions. We believe instead that sets of interesting, well-designed experiments will be able teach us more than any Orwellian state will learn. To conclude data science will lead to more happiness if we will not forget that Data Science is NOT about I.T., it is about ideas behind technology.


How can we determine an appropriate rate of decay for our scoring algorithms? - Psychology

In this document we describe the current methodology behind our predictive model and discuss some interesting ideas and problems with prediction in golf more generally. We have previously written about our first attempt at modelling golf here, which I would recommend reading but is not necessary to follow the contents of this article. This document is a little more technical than the previous one, so if you are struggling to follow along here it is probably worth reading the first methodology blog.

The goal of this prediction exercise is to estimate probabilities of certain finish positions in golf tournaments (e.g. winning, finishing in the top 10). We are going to obtain these estimates by specifying a probability distribution for each golfer's scores. With those distributions in hand, the probability of any tournament result can be estimated through simulation. Let's dig in to the details.

The estimating sample includes data from 2004-onwards on the PGA, European, and Korn Ferry tours, data from 2010-onwards on the remaining OWGR-sanctioned tours, and amateur data from GolfStat and WAGR since 2010. We use a regression framework to predict a golfer's adjusted score in a tournament-round using only information available up to that date. This seems to be a good fit for our goals with this model (i.e. predicting out-of-sample), while you could maybe argue the model in (1) would be better at describing data in-sample. In this first iteration of the model, the main input to predict strokes-gained is a golfer's historical strokes-gained (seems logical enough, right?). We expect that recent strokes-gained performances are more relevant than performances further into the past, but we will let the data decide whether and to what degree that is the case. For now, suppose we have a weighting scheme: that is, each round a golfer has played moving back in time has been assigned a weight. From this we construct a weighted average and use that to predict a golfer's adjusted strokes-gained in their next tournament-round. Also used to form these predictions are the number of rounds that the weighted average is calcuated from, and the number of days since a golfer's last tournament-round. More specifically, predictions are the fitted values from a regression of adjusted strokes-gained in a given round on the set of predictors (weighted average SG up to that point in time, rounds played up to that point in time, days since last tournament-round) and various interactions of these predictors. The figure below summarizes the predictions the model makes: we plot fitted values as a function of how many rounds a golfer has played for a few different values of the weighted strokes-gained average:

The second takeaway is the pattern of discounting as a function of the number of rounds played. As you would expect, the smaller the sample of rounds we have at our disposal, the more a golfer's past performance is regressed to the mean. As the number of rounds goes to zero, our predictions converge towards about -2 adjusted strokes-gained. It should also be pointed out that another input to the model is which tour (PGA, Euro, or Web.com) the tournament is a part of: this has an impact on very low-data predictions, as rookies / new players are generally of different quality on different tours.

The predicted values from this regression are our estimates for the player-specific means. What about player-specific variances? These are estimated by analyzing the residuals from the regression model above. The residuals are used because we want to estimate the part of the variance in a golfer's adjusted scores that is not due to their ability changing over time. We won't cover the details of estimating player-specific variances, but will make general two points. First, golfers for whom we have a lot of data have their variance parameter estimated just using their data, while golfers with less data available have their variance parameters estimated by looking at similar golfers. Second, estimates of variance are not that predictive (i.e. high-variance players in 2017 will tend to have lower variances in 2018). Therefore, we regress our variance estimates towards the tour average (e.g. a golfer who had a standard deviation of 3.0 in 2018 might be given an estimate of 2.88 moving forward).

With an assumption of normality, along with estimates (or, predictions) of each golfer's mean adjusted strokes-gained and the variance in their adjusted strokes-gained, we can now easily simulate a golf tournament. Each iteration draws a score from each golfer's probability distribution, and through many iterations we can define the probability of some event (e.g. golfer A winning) as the number of times it occured divided by the number of iterations. (As indicated earlier, in practice we do not actually assume scores are normally distributed — although we do still use the player-specific variance parameters to inform the shape of the distributions we use for simulation.)

So what do we find? The coefficients are, roughly, ( ormalsize eta_ <1>= 1.2 ), ( ormalsize eta_ <2>= 1 ), ( ormalsize eta_ <3>= 0.9 ), and ( ormalsize eta_ <4>= 0.6 ). Recall their interpretation: ( ormalsize eta_ <1>) can be thought of as the predicted increase in total strokes-gained from having a historical average SG:OTT that is 1 stroke higher, holding constant the golfer's historical performance in all other SG categories. Therefore, the fact that ( ormalsize eta_ <1>) is greater than 1 is very interesting (or, worriesome?!). Why would a 1 stroke increase in historical SG:OTT be associated with a greater than 1 stroke increase in future total strokes-gained? We can get an answer by looking at our subregressions: using ( ormalsize OTT_ ) as the dependent variable, the coefficient is close to 1 (as we would perhaps expect), using ( ormalsize APP_ ) the coefficient is 0.2, and for the other two categories the coefficients are both roughly 0. So, if you take these estimates seriously (which we do this is a robust result), this means that historical SG:OTT performance has predictive power not only for future SG:OTT performance, but also for future SG:APP performance. That is interesting. This means that for a golfer who is currently averaging +1 SG:OTT and 0 SG:APP, we should predict their future SG:APP to be something like +0.2. A possible story here is that a golfer's off-the-tee play provides some signal about a golfer's general ball-striking ability (which we would define as being useful for both OTT and APP performance). The other coefficients fall in line with our intution: putting is the least predictive of future performance.

How can we incorporate this knowledge into our predictive model to improve it's performance? The main takeaways from the work above is that the strokes-gained categories differ in their predictive power for future strokes-gained performance (with OTT > APP > ARG > PUTT). However, a difficult practical issue is that we only have data on detailed strokes-gained performance for a subset of our data: namely PGA Tour events that have ShotLink set up on-site. We incorporate our findings above by using a reweighting method for each round that has detailed strokes-gained data available if the SG categories aren't available, we simply use total strokes-gained. In this reweighting method, if there were two rounds that both were measured as +2 total strokes-gained, with one mainly due to off-the-tee play while the other was mainly due to putting, the former would be increased while the latter would be decreased. To determine which weighting works best, we just evaluate out-of-sample fit (discussed below). That's why prediction is relatively easy, while casual inference is hard.

One thing that becomes clear when testing different parameterizations is how similar they perform overall despite disagreeing in their predictions quite often. This is troubling if you plan to use your model to bet on golf. For example, suppose you and I both have models that perform pretty similar overall (i.e. have similar mean-squared prediction error), but also disagree a fair bit on specific predictions. This means that both of our models would find what we perceive to be "value" in betting on some outcome against the other's model. However, in reality, there is not as much value as you think: roughly half of those discrepancies will be cases where your model is "incorrect" (because we know, overall, that the two models fit the data similarly). This is not exactly a deep insight: it simply means that to assume your model's odds as *truth* is an unrealistic best-case scenario for calculating expected profits.

The model that we select through the cross validation exercise has a weighting scheme that I would classify as "medium-term": rounds played 2-3 years ago do receive non-zero weight, but the rate of decay is fairly quick. Compared to our previous models this version responds more to a golfer's recent form. In terms of incorporating the detailed strokes-gained categories, past performance that has been driven more by ball-striking, rather than by short-game and putting, will tend to have less regression to the mean in the predictions of future performance.

To use the output of this model — our pre-tournament estimates of the mean and variance parameters that define each golfer's scoring distribution — to make live predictions as a golf tournament progresses, there are a few challenges to be addressed.

First, we need to convert our round-level scoring estimates to hole-level scoring estimates. This is accomplished using an approximation which takes as input our estimates of a golfer's round-level mean and variance and gives as output the probability of making each score type on a given hole (i.e. birdie, par, bogey, etc.).

The third challenge is updating our estimates of player ability as the tournament progresses. This can be important for the golfers that we had very little data on pre-tournament. For example, if for a specific golfer we only have 3 rounds to make the pre-tournament prediction, then by the fourth round of the tournament we will have doubled our data on this golfer! Updating the estimate of this golfer's ability seems necessary. To do this, we have a rough model that takes 4 inputs: a player's pre-tournament prediction, the number of rounds that this prediction was based off of, their performance so far in the tournament (relative to the appropriate benchmark), and the number of holes played so far in the tournament. The predictions for golfers with a large sample size of rounds pre-tournament will not be adjusted very much: a 1 stroke per round increase in performance during the tournament translates to a 0.02-0.03 stroke increase in their ability estimate (in units of strokes per round). However, for a very low data player, the ability update could be much more substantial (1 stroke per round improvement could translate to 0.2-0.3 stroke updated ability increase).

With these adjustments made, all of the live probabilities of interest can be estimated through simulation. For this simulation, in each iteration we first draw from the course difficulty distribution to obtain the difficulty of each remaining hole, and then we draw scores from each golfer's scoring distribution taking into account the hole difficulty.

The clear deficiency in earlier versions of our model was that no course-specific elements were taken into account. That is, a given golfer had the same predicted mean (i.e. skill) and variance regardless of the course they were playing. After spending a few months slumped over our computers, we can now happily say that our model incorporates both course fit and course history for PGA Tour events. (For European Tour events, the model only includes course history adjustments.) Further, we now account for differences in course-specific variance, which captures the fact that some courses have more unexplained variance (e.g. TPC Sawgrass) than others (e.g. Kapalua).

This will be a fairly high-level explainer. We'll tackle course fit and then course-specific variance in turn. The approach to course fit that was ultimately successful for us was, ironically, the one we described in a negative light a year ago. For each PGA Tour course in our data we estimate the degree to which golfers with certain attributes under or over-perform relative to their baselines (where a golfer's baseline is their predicted skill level at a neutral course). The attributes used are driving distance, driving accuracy, strokes-gained approach, strokes-gained around-the-green, and strokes-gained putting. More concretely, we correlate a golfer's performance (i.e. strokes-gained) at a specific course with an estimate of their skill level in each attribute (distance, accuracy, approach, etc.) at that point in time. Attribute-specific skill levels are obtained using analogous methods to those which were described in an earlier section to obtain golfers' overall skill level. For example, a player's predicted driving distance skill at time t is equal to a weighted average of previous adjusted (for field strength) driving distance performances, with more recent rounds receiving more weight, and regressed appropriately depending on how many rounds comprise the average. The specific weighting scheme differs by characteristic not suprisingly, past driving distance and accuracy are very predictive of future distance and accuracy, and consequently relatively few rounds are required to precisely estimate these skills. Conversely, putting performance is much less predictive, which results in a longer-term weighting scheme and stronger regression to the mean for small samples.

With estimates of golfer-specific attributes in hand, we can now attempt to estimate a course-specific effect for each attribute on performance — for example, the effect of driving distance on performance (relative to baseline) at Bethpage Black. The main problem when attempting to estimate course-specific parameters is overfitting. Despite what certain sections of Golf Twitter would have you believe, attempting to decipher meaningful course fit insights from a single year of data at a course is truly a hopeless exercise. This is true despite the fact that a year's worth of data from a full-field event yields a nominally large sample size of roughly 450 rounds. Performance in golf is mostly noise, so to find a predictive signal requires, at a minimum, big sample sizes (it also requires that your theory makes some sense). To avoid overfitting, we fit a statistical model known as a random effects model. It's possible to understand its benefits without going into the details. Consider estimating the effect of our 5 attributes on performance-to-baseline separately for each course: it's easy to imagine that you might obtain some extreme results due to small sample sizes. Conversely, you could estimate the effect of our 5 golfer attributes on performance-to-baseline by pooling all of the data together: this would be silly as it would just give you an estimate of 0 for all attributes (as we are analyzing performance relative to each golfer's baseline, which has a mean of zero, by definition). The random effects model strikes a happy medium between these two extremes by shrinking the course-specific estimates towards the overall mean estimate, which in this case is 0. This shrinkage will be larger at courses for which we have very little data, effectively keeping their estimates very close to zero unless an extreme pattern is present in the course-specific data. Here is a nice interactive graphic and explainer if you want more intuition on the random effects model. Switching to this class of model is one of the main reasons our course fit efforts were more successful this time around.

What are the practical effects of incorporating course fit? While in general the difference between the new model, which includes both course fit and course history adjustments, and the previous one (which we'll refer to as the baseline model) are small, there are meaningful differences in many instances. If we consider the differences between the two models in terms of their respective estimated skill levels (i.e. the player-specific means), the 25th and 75th percentiles are -0.07 and +0.07 respectively, while the minimum and maximum are -0.93 and +0.95, respectively (units are strokes per round). I can't say I ever thought there would come a day when we would advocate for a 1 stroke adjustment due to course fit! And yet, here we are. Let's look at an example: before the 2011 Mayakoba Classic at El Camaleon Golf Club, we estimated Brian Gay to be 21 yards shorter off the tee and 11 percentage points more accurate (in fairways hit per round) than the PGA Tour average. This made Gay an outlier in both skills, sitting at more than 2 standard deviations away from the tour mean. Furthermore, El Camaleon is probably the biggest outlier course on the PGA Tour, with a player's driving accuracy having almost twice as much predictive power on performance as their driving distance (there are only 11 courses in our data where driving accuracy has more predictive power than distance). Therefore, at El Camaleon, Gay's greatest skill (accuracy) is much more important to predicting performance than his greatest weakness (distance). Further, Gay had had good course history at El Camaleon, averaging 1.2 strokes above baseline in his 12 previous rounds there. (It's worth pointing out that we estimate the effects of course history and course fit together, to avoid 'double counting'. That is, good course fit will often explain some of a golfer's good course history.) Taken together, this resulted in an upward adjustment of 0.9 strokes per round to Gay's predicted skill level for the 2011 Mayakoba Classic.

When evaluating the performance of this new model relative to the baseline model, it was useful to focus our attention on observations where the two models exhibit large discrepancies. The correlation between the two models' predicted skill levels in the full sample is still 0.99 as a consequence the difference between their respective prediction errors using the full sample will always be relatively small. However, by focusing on observations where the two models diverge substantially, it becomes clear that the new model is outperforming the baseline model.

As previously alluded to, the second course-specific adjustment we've made to our model is the inclusion of course-specific variance terms. This means that the player-specific variances will all be increased by some amount at certain courses and decreased at others. It's important to note that we are concerned with the variance of 'residual' scores here, which are the deviations in players' actual scores from our model predictions (this is necessary to account for the fact that some courses, like Augusta National, have a higher variance in total scores in part because there is greater variance in the predicted skill levels of the players there). All else equal, adding more unexplained variance — noise — to scores will bring the model's predicted win probabilities (for the tournament, for player-specific matchups, etc.) closer together. That is, Dustin Johnson's win probability at a high (residual)-variance course will be lower than it is at a low-variance course, against the same field. In estimating course-specific variances, care is again taken to ensure we are not overfitting. Perhaps surprisingly, course-specific variances are quite predictive year-over-year, leading to some meaningful differences in our final course-specific variance estimates. Examples of some of the lowest variance courses are Kapalua, Torrey Pines (North), and Sea Island (Seaside) some of the highest variance courses are Muirfield Village, TPC Sawgrass, and Whistling Straights. A subtle point to note here is that a course can simultaneously have high residual variance and also be a course that creates greater separation amongst players' predicted skill levels. For example, at Augusta National, golfers with above-average driving distance, who tend to have higher baseline skill levels, are expected to perform above their baselines additionally, Augusta National is a course with above-average residual variance. Therefore, whether we would see the distribution of win probabilities narrow or widen at Augusta (relative to a typical PGA Tour course) will depend on which of these effects dominates.

There are a few important changes to the 2021 model. First, we are now incorporating a time dimension to our historical strokes-gained weighting scheme. This was an important missing element from earlier versions of the model. For example, when Graham DeLaet returned in early 2020 after a 1-year hiatus from competitive golf, our predictions were mostly driven by his data from 2018 and earlier, even after DeLaet had played a few rounds in 2020. It seems intuitive that more weight should be placed on DeLaet's few rounds from 2020 given the absence of 2019 data (compared to a scenario where he had played a full 2019 season). Using a weighting function that decays with time (e.g. days) achieves this. However, continuing with the DeLaet example, there is still lots of information contained in his pre-2019 rounds. Therefore we use an average of our two weighted averages: the first weights rounds by the sequence in which they were played, ignoring the time between rounds, while the second assigns weights based on how recently the round was played. In DeLaet's case, if he were playing this week (Jan 4, 2021), his time-weighted predicted strokes-gained would be -3.1, while his sequence-weighted predicted strokes-gained would be -1.1. Ultimately we combine these two predictions and end up with a final prediction of -2.5 (we don't just take a simple average of the two it depends on a couple parameters). The difference between this value and the sequence-weighted average is what appears in the "timing" column on the skill decomposition page. Have a look at DeLaet's true strokes-gained plot to understand why the different weighting methods cause such a large divergence in estimated skill. For golfers who are playing a steady schedule, there will not be large differences between the time-weighted and sequence-weighted strokes-gained averages. However for players that play an above-average number of rounds per year (e.g. Sungjae Im), the time-weighting will tend to de-emphasize their most recent data.

A second change to the 2021 model is that we are using yet another method to incorporate the strokes-gained categories into our baseline (i.e. pre-course fit / history) predictions. As we've said elsewhere, it is surprisingly difficult to use the strokes-gained categories in a way that doesn't make your predictions worse. This is because not all PGA Tour and European Tour events have strokes-gained data (which reminds me: another new thing for this season is that we have added European Tour SG category data). Therefore, if you were to leverage the category SG data but (necessarily) only use rounds with detailed SG data, you would be outperformed by a model that only uses total strokes-gained but uses data from all events. This highlights the importance of recent data in predicting golf performance. Our previous strokes-gained category adjustment method involved predicting performance in the SG categories using total SG in rounds where the categories weren't available, and then estimating skill in each SG category using a combination of the real and imputed SG data. This worked reasonably well but had its drawbacks. I'll omit the details on our current method, but it no longer uses imputed SG data. The general goal when incorporating the SG categories is to account for the fact that estimating skill in ARG / PUTT requires a larger sample size than does APP / OTT. Therefore, if a golfer's recent performance is driven by a short-term change in ARG or PUTT, their SG adjustment will be in the opposite direction of that recent change (e.g. a short-term uptick in putting performance from a golfer's long-term baseline results in a negative SG category adjustment). The logic is reversed for OTT: if the recent uptick in performance is driven by OTT play, the SG category adjustment will be positive.

A third model update involves how our predictions are updated between rounds within a tournament. In the past we have been a bit lazy when predicting a golfer's Round 2 (R2) performance given their R1 performance (plus historical pre-tournament data). Now we have explicity estimated what that update should be, and, interestingly, we also allow the weight applied to a golfer's R1 performance to vary depending on a few factors. These factors include the number of days since a golfer last played an event (this increases the weight on R1 performance when predicting R2, all else equal), the total number of rounds a golfer has played (this really only matters for players with fewer than

50 rounds), and also a measure of the overall recency of a player's data. For example, as mentioned above Sungjae Im is a golfer who doesn't take many weeks off therefore, when predicting his R2 performance, Im's R1 score is weighted less than the typical tour player. It should be clear that this is tightly linked to the ideas behind using a time-weighted decay the further into the past is the bulk of a golfer's historical data, the more weight their R1 performance will receive when predicting R2. Similar logic is applied when predicting R3 and R4 performance. This will have obvious implications for our model when predicting performance after a tournament starts (e.g. for the live model, R2 matchups), but it also matters for pre-tournament predictions. The stronger the correlation between within-tournament performances (e.g. the larger the weight on R1 when predicting R2), the larger is the variance in a golfer's tournament outcomes. Thinking through the simulation of a golf tournament can help clarify this: if a golfer plays well in the first round their predicted skill for the second round simulation is increased, while if they play poorly their R2 skill is decreased. The larger the weight on R1 performance, the wider is the range of their possible predicted skill levels for R2, which in turn leads to a wider range of potential scores. Compared to our previous model, this change in within-tournament weighting does not make a huge difference for pre-tournament predictions, but it will be noticeably different once the tournament starts and skill levels are actually updated. I'm also excited about how the differential weighting can handle players coming off of long layoffs, as I've always felt these golfers' skill estimates should respond more to their R1/R2/R3 performances. A final point is that we also incorporate the strokes-gained categories into the within-tournament updates when possible (leveraging the fact that, eg, outperforming one's baseline in OTT in R1 is much more informative than doing the same with putting).

Finally, the European Tour version of our model in 2021 will now include course fit adjustments. Course fit, along with the aforementioned addition of European Tour strokes-gained category data, should bring the quality of our Euro predictions up to the level of our PGA predictions.

There have been enough changes in our model recently to warrant a deviation from our previous schedule of once-a-year written updates. We also think it's important for our users, as we try to add model customizability to the site, to know what our model is taking into account in order to understand the dimensions along which it can be improved. Therefore this is partly an attempt to put more relevant information about our model in one place. Now, to the updates.

First — and this was actually added sometime in late 2020 — the effects of "pressure" are now included in the model. This is an adjustment we make to players' predicted skill in the third and fourth rounds that depends on their position on the leaderboard to start the day. This is not a player-specific adjustment, and does not vary depending on player characteristics either (e.g. elite vs. non-elite golfers). We haven't found meaningful differences in performance relative to expectation when leading across categories of players — it seems everyone performs worse when near the lead. There are a lot more details on the pressure-performance relationship in this blog and on this interactive page.

Second, we recently revisited the question of the optimal decay rates for the various weighting schemes used on a player's historical data. Relative to the market it seems like our model weights longer-term data more heavily. For some context, when using only total strokes-gained to predict future performance, our model places about 70% of the weight on the most recent 50 rounds for a golfer who has been playing a full schedule. This is for the sequence-weighted average described in the section above this one. Also mentioned in that section, and this was a new addition for 2021, was the time-weighted average. We have now made that weighting scheme slightly more short-term oriented. The interesting, general, thing that we learned by revisiting these decay rates is that the optimal weighting scheme for a specific weighted average shouldn't be chosen in isolation. For example, if we were only to use a sequence-weighted average to predict performance, then the optimal decay would be larger (i.e. more weight on the short-term) than if we use it in conjunction with a time-weighted average. In this specific case, I think that makes sense, as the role of the sequence-weighted average is in part to pick up longer trends in performance if a player hasn't played much recently.

The other weighting schemes we revisited are those used on the specific strokes-gained categories. Omitting the details, we are also now slightly more short-term focused on all the SG categories (for the same reason specified above — when using the categories together instead of in isolation, it appears that short-term weighting schemes are slightly better). The upshot of this is that the strokes-gained category adjustments used to be somewhat biased towards longer-term data. That is, even ignoring differential performance in the categories, which is what we want to be driving that adjustment, if a player was performing well in the short-term they were likely to be receiving a negative SG adjustment. Going forward this will no longer be an issue. While on the topic of the SG adjustment, it's worth mentioning the European Tour SG category data. As discussed here, because most of the Euro SG category data are event-level averages we have to impute the round-level values. This is not a huge issue, but it does make it difficult to actually fit a model for predicting strokes-gained categories at the round level on the European Tour. As a result, we have to rely on our PGA Tour models for the SG categories and hope the relationships in that data also hold in the Euro data. The degree to which this works can still ultimately be tested by looking at whether our overall predictions are improved. However, again there are issues: we only have 4 years of European tour strokes-gained category data, which is not quite enough to get precise answers. We want to determine if we can use the European SG category data in some way to improve over our baseline predictions which are based off only total SG data these two estimates of skill will inevitably be very highly correlated, and so over 4 years there actually aren't that many instances where the two methods will disagree substantially (allowing for their respective predictive performance to be compared). In any case, the practical takeway here is that we are decreasing the size of the SG category adjustments we make on the European Tour slightly.

With regards to the overall long-term versus short-term focus of our model, it is useful to consider two recent examples of players that we didn't update as quickly on relative to the market: Jordan Spieth and Lee Westwood. They are instructive examples of cases that our model might not handle well for different reasons. In the case of Spieth, part of the reason the market reacted so quickly was that Spieth had proven himself to be a world-beater early in his career. The idea seemed to be that once Spieth flashed some of his old form we could essentially ignore the data from the two-year slump he was pulling himself out of. While I obviously don't agree with ignoring the slumping-Spieth data, I do think it's important to account for the fact that Spieth used to be a top player our current model doesn't do this, as rounds from more than a couple years ago essentially receive no weight. We are working on a few different ideas to handle these cases, but it's a tricky problem to find a systematic solution to. In the case of Lee Westwood, I think the reason the market adjusted so quickly was that he had two massive outlier performances (gaining around 4, and then 3.5, strokes above expectation in back-to-back events). Westwood's performances at the API and The Players were his 1st and 3rd largest positive deviations from our model's expectation since 2015. I think a case could be made for making proportionally larger adjustments for larger deviations from expectation (e.g. adjust more than 2x when a player gains 4 strokes versus 2 strokes). Using some form of Bayesian updating, which uses the likelihood of something occuring to inform the size of the update, might achieve this but it's not clear how we would go about it. However it's also important to remember that while it was unlikely for Westwood specifically to have back-to-back outlier weeks like he did, it was not that unlikely that someone would do it. Just like it's exceedingly unlikely for a specific person to win the lottery, but someone has to win it. Given how Westwood has performed since those 2 weeks, it seems like there hasn't been an actual shift in his skill level Spieth, on the other hand, appears to be back.

Third, it's worth going into a little more detail on the within-tournament updates to player skill levels. We haven't changed anything in our approach since the start of 2021, but it is interesting and we skimmed over it in the previous section. Compared to previous versions of our model, a player's skill level now has the potential to be updated much more as a result of their performance during a tournament. For example, performing 1 stroke above expectation off the tee in round 1 leads to a 0.12 stroke update in projected skill for round 2. (In all of this, there are several adjustments made to ensure that each golfer's strokes-gained category skill estimates add up to our estimate of what their overall skill is. This adjustment can be large if most of a player's data is from events without detailed SG data.) Deviations from expected performance in the other SG categories carry considerably less predictive weight (with APP > ARG

PUTT), however there are also greater deviations from expectation in approach and putting performance than those seen off the tee. I think the reason off-the-tee performance in a single round can be so predictive is partly just because it is a lower-noise statistic, but also because it provides the most information about course fit. Driving distance and driving accuracy show the largest differences across courses in their course-specific predictive power assuming these effects operate mainly through off-the-tee play (although it likely happens through approach as well), it's reasonable to think that first round performance is providing some information on the golfer's course fit (that we haven't captured in our pre-tournament fit adjustments). This information still comes with a lot of noise, but it does add to the signal we can glean from that single round. When strokes-gained category data is not available we only update skill levels based off of total strokes-gained.

Fourth, weather is now explicity accounted for in the model. We mainly focus on wind forecasts, as the effect of wind on performance is the most straightforward to predict (whereas rain could make a course play easier or harder, depending on the specific course setup). In our pre-tournament simulations each player receives an adjustment to their expected strokes-gained per round based on the projected wind conditions for the time window that they'll be on the course. Simple enough. In our live simulations, we have to use both historical and forecasted weather conditions to predict how the remaining holes will play for each golfer. The historical part is necessary because it's not enough to know, for example, that Brian Harman is expected to face a 10mph wind over his next 9 holes we also need to know the weather conditions from earlier in the day that led to the current scoring average on those holes. Our approach is to first adjust our historical scoring data to a set of baseline conditions (e.g. 5mph wind). This includes data from past years, rounds, waves, and of course the scoring data from earlier in the day. We then use pre-fit models that take as inputs all data from before the current wave of golfers. Then, we use an explicitly Bayesian update to combine this start-of-wave prediction with the live scoring data from that wave. This is easy to do if we assume our start-of-wave projection is normally distributed and also that golf scores are normal. The result is a live estimate of each hole's difficulty under the baseline weather conditions which, given the normality assumption, is encapsulated in a mean (i.e. the expected scoring average for the hole) and a variance (i.e. the uncertainty around this expectation). Returning to the Harman example, all that is left is to make a scoring adjustment (e.g. 0.05 strokes per 1mph of wind) that accounts for the difference between Harman's forecasted conditions and baseline conditions. (For some intuition on thinking through how this fits into a simulation, see this FAQ section.) The final piece of this puzzle was added very recently, and that was factoring in player skill when estimating course difficulty. Just as with weather, we can adjust the observed scoring averages for each hole to reflect what we would expect from some baseline golfer then, to project Brian Harman's performance, we simply add in an adjustment to account for the difference between Harman's skill and that of our baseline golfer. (Previously, we would assume that observed scoring averages came from a group of players with an average skill level equal to the field's average — which, is generally an OK assumption except when there are only a few players on the course.) The impact of these changes to the live model are most noticeable when projecting the cutline, but also matter for projecting any outcome of a golf tournament when the players involved have differing numbers of holes remaining or will be playing their remaining holes at different times. With these adjustments, our live model is now pretty good at projecting both the expected course conditions and the uncertainty around those conditions.

Fifth and finally, we are starting to incorporate shot-level information into the model. One simple adjustment we are currently making relates to hole outs: we decrease a player's strokes-gained in a given round by 1 stroke for every shot they hole out from a distance above some (arbitrarily chosen) threshold. The idea here is that it's completely luck whether or not a shot ends up an inch from the hole or in it, but the difference in strokes-gained between these two scenarios is a full stroke. The effect of these adjustments on golfers' predicted skill levels are small, rarely exceeding 0.05 strokes. For a little more detail on this, see the 4th number of this ITN.


Using ROC Curves

Threshold Selection

It is immediately apparent that a ROC curve can be used to select a threshold for a classifier which maximises the true positives, while minimising the false positives.

However, different types of problems have different optimal classifier thresholds. For a cancer screening test, for example, we may be prepared to put up with a relatively high false positive rate in order to get a high true positive, it is most important to identify possible cancer sufferers.

For a follow-up test after treatment, however, a different threshold might be more desirable, since we want to minimise false negatives, we don’t want to tell a patient they’re clear if this is not actually the case.

Performance Assessment

ROC curves also give us the ability to assess the performance of the classifier over its entire operating range. The most widely-used measure is the area under the curve (AUC). As you can see from Figure 2, the AUC for a classifier with no power, essentially random guessing, is 0.5, because the curve follows the diagonal. The AUC for that mythical being, the perfect classifier, is 1.0. Most classifiers have AUCs that fall somewhere between these two values.

An AUC of less than 0.5 might indicate that something interesting is happening. A very low AUC might indicate that the problem has been set up wrongly, the classifier is finding a relationship in the data which is, essentially, the opposite of that expected. In such a case, inspection of the entire ROC curve might give some clues as to what is going on: have the positives and negatives been mislabelled?

Classifier Comparison

The AUC can be used to compare the performance of two or more classifiers. A single threshold can be selected and the classifiers’ performance at that point compared, or the overall performance can be compared by considering the AUC.

Most published reports compare AUCs in absolute terms: “Classifier 1 has an AUC of 0.85, and classifier 2 has an AUC of 0.79, so classifier 1 is clearly better“. It is, however, possible to calculate whether differences in AUC are statistically significant. For full details, see the Hanley & McNeil (1982) paper listed below.


Watch the video: 8 English Sentences: Find the Mistakes (January 2022).