I wrote my layman’s introduction to scoring a while ago now and never delivered the promised more in-depth articles. This is the first in a line of articles correcting that oversight. The team at Scorto has very kindly provided me with a white paper on scorecard building, which I will break into sections and reproduce here. In the first of those articles, I’ll look into reject inference, a topic that has been asked about before.
One of the inherent problems with a scorecard is that while you can test easily test whether you made the right decision in accepting an application, it is less easy to know whether you made the right decision in rejecting an application. In the day-to-day running of a business this might not seem like much of a problem, but it is dangerous in two ways: · it can limit the highly profitable growth opportunities around the cut-off point by hiding any segmenting behaviour a characteristic might have; and · it can lead to a point where the data that is available for creating new scorecards represents only a portion of the population likely to apply. As this portion is disproportionately ‘good’ it can cause future scorecards to under-estimate the risk present in a population. Each application provides a lender with a great deal of characteristic data: age, income, bureau score, etc. That application data is expensive to acquire, but of limited value until it is connected with behavioural data. When an application is approved, that value-adding behavioural data follows as a matter of course and comes cheaply: did the customer of age x and with income of y and a bureau score of z go “bad” or not? Every application that is rejected gets no such data. Unless we go out of our way to get it; and that’s where reject inference comes into play.
The general population in a market will have an average probability of bad that is influenced by various national and economic characteristics, but generally stable. A smaller sub-population will make-up the total population of applicants for any given loan product –the average probability of bad in this total population will rise and fall more easily depending on marketing and product design. It is the risk of that total population of applicants that a scorecard should aim to understand. However, the data from existing customers is not a full reflection of that population. It has been filtered through the approval process it stripped of a lot of its bads. Very often, the key data problem when building a scorecard build is the lack of information on “bad” since that’s what we’re trying to model, the probability an application with a given set of characteristics will end up “bad”. The more conservative the scoring strategy in question, the more the data will become concentrated in the better score quadrants and the weaker it will become for future scorecard builds. Clearly we need a way to bring back that information. Just because the rejected applications were too risky to approve doesn’t mean they’re too risky to add value in this exercise. We do this by combining the application data of the rejected applicants with external data sources or proxies. The main difficulty related to this approach is the unavailability and/ or inconsistency of the data which may make it difficult to classify an outcome as “good” or “bad”. A number of methods can be used to infer the performance of rejected applicants.
Simple Augmentation
Not all rejected applications would have gone bad. We knew this at the time we rejected them, we just knew that too few would stay good to compensate for those that did go bad. So while a segment of applications with a 15% probability of bad might be deemed too risky, 85% of them would still be good accounts. Using that knowledge we can reconsider the rejected applications in the data exercise.
· A base scoring model is built using data from the borrowers whose behavior is known – the previously approved book.
· Using the developed model, the rejected applications are scored and an estimation is made of the percentage of “bad” borrowers and that performance is assigned at random but in proportion across the rejected applications.
· The cut-off point should be set in accordance with the rules of the current lending policy that define the permissible level of bad borrowers.
· Information on the rejected and approved requests is merged and the resulting set is used to build the final scoring model.
Accept/ Reject Augmentation
The basis of this method consists in the correction of the weights of the base scoring model by taking into consideration the likelihood of the request‘s approval.
· The first step is to build a model that evaluates the likelihood of a requests approval or rejection. · The weights of the characteristics are adjusted taking into consideration the likelihood of the request‘s approval or rejection, determined during the previous step. This is done so that the resulting scores are inversely proportional to the likelihood of the request‘s approval. So, for example, if the original approval rate was 50% in a certain cluster then each approved record is replicated to stand in for itself and the one that was rejected.
· This method is preferable to the Simple Augmentation method, but not without its own drawbacks. Two key problems can be created by augmentation: the impact of small and unusual groups can be exaggerated (such as low-side overrides for VIP clients) and then because you’ve only modeled on approved accounts the approval rates will be either 0% or 100% in each node.
Fuzzy Augmentation
The distinguishing feature of this method is that each rejected request is split and used twice, to reflect each of the likelihood of the good and bad outcomes. In other words, if a rejected application has a 15% probability of going bad it is split and 15% of the person is assumed to go bad and 85% assumed to stay good.
· Classification
Evaluation of a set of the rejected requests is performed using a base scoring model that was built based on requests with a known status;
– The likelihood of a default p(bad) and that of the “good” outcome p(good) are determined based on the set cut-off point, defining the required percentage of the “bad” requests (p(bad)+p(good)=1); – Two records that correspond to the likelihood of the “good” and “bad” outcomes are formed for each rejected request;
– Evaluation of the rejected requests is performed taking into consideration the likelihood of the two outcomes. Those accounts that fall under the likelihood of the “good” outcome are assigned with the weight p(good). The accounts that fall under the likelihood of the “bad” outcome are assigned with the weight p(bad).
· Clarification
– The data on the approved requests is merged with the data on the rejected requests and the rating of each request is adjusted taking into consideration the likelihood of the request‘s further approval. For example, the frequency of the “good” outcome for a rejected request is evaluated as the result of the “good” outcome multiplied by the weight coefficient.
– The final scoring model is built based on the combined data set.
Reject inference is no a single silver bullet. Used inexpertly it can lead to less accurate rather than more accurate results. Wherever possible, it is better to augment the exercise with a test-and-learn experiment to understand the true performance of small portions of key rejected segments. Then a new scorecard can be built based on the data from this new test segment alone and the true bad rates from that model can be compared and averaged to those from the reject inference model to get a more reliable bad rate for the rejected population.