Posts Tagged ‘Scorecards’

I wrote my layman’s introduction to scoring a while ago now and never delivered the promised more in-depth articles. This is the first in a line of articles correcting that oversight. The team at Scorto has very kindly provided me with a white paper on scorecard building, which I will break into sections and reproduce here. In the first of those articles, I’ll look into reject inference, a topic that has been asked about before.

One of the inherent problems with a scorecard is that while you can test easily test whether you made the right decision in accepting an application, it is less easy to know whether you made the right decision in rejecting an application. In the day-to-day running of a business this might not seem like much of a problem, but it is dangerous in two ways: · it can limit the highly profitable growth opportunities around the cut-off point by hiding any segmenting behaviour a characteristic might have; and · it can lead to a point where the data that is available for creating new scorecards represents only a portion of the population likely to apply. As this portion is disproportionately ‘good’ it can cause future scorecards to under-estimate the risk present in a population. Each application provides a lender with a great deal of characteristic data: age, income, bureau score, etc. That application data is expensive to acquire, but of limited value until it is connected with behavioural data. When an application is approved, that value-adding behavioural data follows as a matter of course and comes cheaply: did the customer of age x and with income of y and a bureau score of z go “bad” or not? Every application that is rejected gets no such data. Unless we go out of our way to get it; and that’s where reject inference comes into play.

The general population in a market will have an average probability of bad that is influenced by various national and economic characteristics, but generally stable. A smaller sub-population will make-up the total population of applicants for any given loan product –the average probability of bad in this total population will rise and fall more easily depending on marketing and product design. It is the risk of that total population of applicants that a scorecard should aim to understand. However, the data from existing customers is not a full reflection of that population. It has been filtered through the approval process it stripped of a lot of its bads. Very often, the key data problem when building a scorecard build is the lack of information on “bad” since that’s what we’re trying to model, the probability an application with a given set of characteristics will end up “bad”. The more conservative the scoring strategy in question, the more the data will become concentrated in the better score quadrants and the weaker it will become for future scorecard builds. Clearly we need a way to bring back that information. Just because the rejected applications were too risky to approve doesn’t mean they’re too risky to add value in this exercise. We do this by combining the application data of the rejected applicants with external data sources or proxies. The main difficulty related to this approach is the unavailability and/ or inconsistency of the data which may make it difficult to classify an outcome as “good” or “bad”. A number of methods can be used to infer the performance of rejected applicants.

Simple Augmentation
Not all rejected applications would have gone bad. We knew this at the time we rejected them, we just knew that too few would stay good to compensate for those that did go bad. So while a segment of applications with a 15% probability of bad might be deemed too risky, 85% of them would still be good accounts. Using that knowledge we can reconsider the rejected applications in the data exercise.

· A base scoring model is built using data from the borrowers whose behavior is known – the previously approved book.
· Using the developed model, the rejected applications are scored and an estimation is made of the percentage of “bad” borrowers and that performance is assigned at random but in proportion across the rejected applications.
· The cut-off point should be set in accordance with the rules of the current lending policy that define the permissible level of bad borrowers.
· Information on the rejected and approved requests is merged and the resulting set is used to build the final scoring model.

Accept/ Reject Augmentation
The basis of this method consists in the correction of the weights of the base scoring model by taking into consideration the likelihood of the request‘s approval.
· The first step is to build a model that evaluates the likelihood of a requests approval or rejection. · The weights of the characteristics are adjusted taking into consideration the likelihood of the request‘s approval or rejection, determined during the previous step. This is done so that the resulting scores are inversely proportional to the likelihood of the request‘s approval. So, for example, if the original approval rate was 50% in a certain cluster then each approved record is replicated to stand in for itself and the one that was rejected.
· This method is preferable to the Simple Augmentation method, but not without its own drawbacks. Two key problems can be created by augmentation: the impact of small and unusual groups can be exaggerated (such as low-side overrides for VIP clients) and then because you’ve only modeled on approved accounts the approval rates will be either 0% or 100% in each node.

Fuzzy Augmentation
The distinguishing feature of this method is that each rejected request is split and used twice, to reflect each of the likelihood of the good and bad outcomes. In other words, if a rejected application has a 15% probability of going bad it is split and 15% of the person is assumed to go bad and 85% assumed to stay good.
· Classification
Evaluation of a set of the rejected requests is performed using a base scoring model that was built based on requests with a known status;
– The likelihood of a default p(bad) and that of the “good” outcome p(good) are determined based on the set cut-off point, defining the required percentage of the “bad” requests (p(bad)+p(good)=1); – Two records that correspond to the likelihood of the “good” and “bad” outcomes are formed for each rejected request;
– Evaluation of the rejected requests is performed taking into consideration the likelihood of the two outcomes. Those accounts that fall under the likelihood of the “good” outcome are assigned with the weight p(good). The accounts that fall under the likelihood of the “bad” outcome are assigned with the weight p(bad).
· Clarification
– The data on the approved requests is merged with the data on the rejected requests and the rating of each request is adjusted taking into consideration the likelihood of the request‘s further approval. For example, the frequency of the “good” outcome for a rejected request is evaluated as the result of the “good” outcome multiplied by the weight coefficient.
– The final scoring model is built based on the combined data set.

Reject inference is no a single silver bullet. Used inexpertly it can lead to less accurate rather than more accurate results. Wherever possible, it is better to augment the exercise with a test-and-learn experiment to understand the true performance of small portions of key rejected segments. Then a new scorecard can be built based on the data from this new test segment alone and the true bad rates from that model can be compared and averaged to those from the reject inference model to get a more reliable bad rate for the rejected population.

Read Full Post »

First things first, I am by no means a scorecard technician. I do not know how to build a scorecard myself, though I have a fair idea of how they are built; if that makes sense. As the title suggests, this article takes a simplistic view of the subject. I will delve into the underlying mathematics at only the highest of levels and only where necessary to explain another point. This article treats scorecards as just another tool in the credit risk process, albeit an important one that enables most of the other strategies discussed on this blog. I have asked a colleague to write a more specialised article covering the technical aspects and will post that as soon as it is available.


Scorecards aim to replace subjective human judgement with objective and statistically valid measures; replacing inconsistent anecdote-based decisions with consistent evidence-based ones. What they do is essentially no different from what a credit assessor would do, they just do it in a more objective and repeatable way. Although this difference may seem small, it enables a large array of new and profitable strategies.

So what is a scorecard?

A scorecard is a means of assigning importance to pieces of data so that a final decision can be made regarding the underlying account’s suitableness for a particular strategy. They do this by separating the data into its individual characteristics and then assigning a score to each characteristic based on its value and the average risk represented by that value.

For example an application for a new loan might be separated into age, income, length of relationship with the bank, credit bureau score, etc. Then the each possible value of those characteristics will be assigned a score based on the degree to which they impact risk. In this example ages between 19 and 24 might be given a score of – 100, ages between 25 and 30 a score of -75 and so on until ages 50 and upwards are given a score of +10. In this scenario young applicants are ‘punished’ while older customers benefit marginally from their age. This implies that risk has been shown to be inversely related to age. The diagram below shows an extract of a possible scorecard:

The score for each of these characteristics is then added to reach a final score. The final score produced by the scorecard is attached to a risk measure; usually something like the probability of an account going 90 days into arrears within the next 12 months. Reviewing this score-to-risk relationship allows a risk manager to set the point at which they will decline applications (the cut-off) and to understand the relative risk of each customer segment on the book. The diagram below shows how this score-to-risk relationship can be used to set a cut-off.

How is a scorecard built?

Basically what the scorecard builder wants to do is identify which characteristics at one point in time are predictive of a given outcome before or at some future point in time. To do this historic data must be structured so that one period can represent the ‘present state’ and the subsequent periods can represent the ‘future state’. In other words, if two years of data is available for analysis (the current month can be called Month 0 and the last Month can be called Month -24) then the most distant six months (from Month -24 to Month -18) will be used to represent the ‘current state’ or, more correctly, the observation period while the subsequent months (Months -17 to 0) represent the known future of those first six months and are called ‘the outcome period’. The type of data used in each of these periods will vary to reflect these differences so that application data (applicant age, applicant income, applicant bureau score, loan size requested, etc.) is important in the observation period and performance data (current balance, current days in arrears, etc.) is important in the outcome period.

With this simple step completed the accounts in the observation period must be defined and sorted based on their performance during the outcome period. To start this process a ‘bad definition’ and ‘good definition’ must first be agreed upon. This is usually something like: ‘to be considered bad, an account must have gone past 90 days in delinquency at least once during the 18 month outcome period’ and ‘to be considered good an account must never have gone past 30 days in delinquency during the same period’. Accounts that meet neither definition are classified as ‘indeterminate’.

Thus separated, the unique characteristics of each group can be identified. The data that was available at the time of application for every ‘good’ and ‘bad’ account is statistically tested and those characteristics with largely similar values within one group but largely varying values across groups are valuable indicators of risk and should be considered for the scorecard. For example if younger customers were shown to have a higher tendency to go ‘bad’ than older customers, then age can be said to be predictive of risk. If on average 5% of all accounts go bad but a full 20% of customers aged between 19 and 25 go bad while only 2% of customers aged over 50 go bad then age can be said to be a strong predictor of risk. There are a number of statistical tools that will identify these key characteristics and the degree to which they influence risk more accurately than this but they won’t be covered here.

Once each characteristic that is predictive of risk has been identified along with its relative importance some cleaning-up of the model is needed to ensure that no characteristics are overly correlated. That is, that no two characteristics are in effect showing the same thing. If this is the case, only the best of the related characteristics will be kept while the other will be discarded to prevent, for want of a better term, double-counting. Many characteristics are correlated in some way, for example the older you are the more likely you are to be married, but this is fine so long as both characteristics add some new information in their own right as is usually the case with age and marital status – an older, married applicant is less risky than a younger, married applicant just as a married, older applicant is less risky than a single, older applicant. However, there are cases where the two characteristics move so closely together that the one does not add any new information and should therefore not be included.

So, once the final characteristics and their relative weightings have been selected the basic scorecard is effectively in place. The final step is to make the outputs of the scorecard useable in the context of the business. This usually involves summarising the scores into a few score bands and may also include the addition of a constant – or some other means of manipulating the scores – so that the new scores match with other existing or previous models.


How do scorecards benefit an organisation?

Scorecards benefit organisations in two major ways: by describing risk in very fine detail they allow lenders to move beyond simple yes/ no decisions and to implement a wide range of segmented strategies; and by formalising the lending decision they provide lenders with consistency and measurability.

One of the major weaknesses of a manual decisioning system is that it seldom does more than identify the applications which should be declined leaving those that remain to be accepted and thereafter treated as being the same. This makes it very difficult to implement risk-segmented strategies. A scorecard, however, prioritises all accounts in order of risk and then declines those deemed too risky. This means that all accepted accounts can still be segmented by risk and this can be used as a basis for risk-based pricing, risk-based limit setting, etc.

The second major benefit comes from the standardisation of decisions. In a manual system the credit policy may well be centrally conceived but the quality of its implementation will be dependent on the branch or staff member actually processing the application. By implementing a scorecard this is no longer the case and the roll-out of a scorecard is almost always accompanied by the reduction in bad rates.

Over-and-above these risk benefits, the roll-out of a scorecard is also almost always accompanied by an increase in acceptance rates. This is because manual reviewers tend to be more conservative than they need to be in cases that vary in some way from the standard. The nature of a single credit policy is such that to qualify for a loan a customer must exceed the minimum requirements for every policy rule. For example, to get a loan the customer must be above the minimum age (say 28), must have been with the bank for more than the minimum period (say 6 months) and must have no adverse remarks on the credit bureau. A client of 26 with a five year history with the bank and a clean credit report would be declined. With a scorecard in place though the relative importance of exceeding one criteria can be weighed against the relative importance of missing another and a more accurate decision can be made; almost always allowing more customers in.


Implementing scorecards

There are three levels of scorecard sophistication and, as with everything else in business, the best choice for any situation will likely involve a compromise between accuracy and cost.

The first option is to create an expert model. This is a manual approximation of a scorecard based on the experience of several experts. Ideally this exercise would be supported by some form of scenario planning tool where the results of various adjustments could be seen for a series of dummy applications – or genuine historic applications if these exist – until the results that meet the expectations of the ‘experts’. This method is better than manual decisioning since it leads to a system that looks at each customer in their entirety and because it enforces a standardised outcome. That said, since it is built upon relatively subjective judgements it should be replaced with a statistically built scorecard as soon as enough data is available to do so.

An alternative to the expert model is a generic scorecard. These are scorecards which have been built statistically but using a pool of similar though not customer-specific data. These scorecards are more accurate than expert models so as long as the data on which they were built reasonably resembles the situation in which they are to be employed. A bureau-level scorecard is probably the purest example of such a scorecard though generic scorecards exist for a range of different products and for each stage of the credit life-cycle.

Ideally, they should first be fine-tuned prior to their roll-out to compensate for any customer-specific quirks that may exist. During a fine-tuning, actual data is run through the scorecard and the results used to make small adjustments to the weightings given to each characteristic in the scorecard while the structure of the scorecard itself is left unchanged. For example, assume the original scorecard assigned the following weightings: -100 for the age group 19 to 24; -75 for the age group 25 to 30; -50 for the age group 31 to 40; and 0 for the age group 41 upwards. This could either be implemented as it is bit if there is enough data to do a fine-tune it might reveal that in this particular case the weightings should actually be as follows: -120 for the age group 19 to 24; -100 for the age group 25 to 30; -50 for the age group 31 to 40; and 10 for the age group 41 upwards. The scorecard structure though, as you can see, does not change.

In a situation where there is no client-specific data and no industry-level data exists, an expert model may be best. However, where there is no client-specific data but where there is industry-level data it is better to use a generic scorecard. In a case where there is both some client-specific data and some industry-level data a fine-tuned generic scorecard will produce the best results.

The most accurate results will always come, however, from a bespoke scorecard. That is a scorecard built from scratch using the client’s own data. This process requires significant levels of good quality data and access to advanced analytical skills and tools but the benefits of a good scorecard will be felt throughout the organisation.

Read Full Post »