A Machine Learning Approach to Selecting P2P Loans

The growth of P2P lending has brought with it an exciting opportunity for investors to create and deploy their own credit scoring models in building their loan portfolios. The appeal here is obvious due, in large part, to the investor’s ability to refine and target profitable “pockets” of loans that simple combinations of filters would not likely be able to isolate. In this paper, we demonstrate how to build a traditional credit model to predict P2P loan defaults as well as introduce some basic “machine learning” concepts.

Some statistics

The dataset for this example is from the Prosper lending platform and includes information from 2005 (inception of the platform) through 2014. We use the “LoanStatus” indicator to define a default as any loan classified as “Chargedoff”, “Defaulted” or “PastDue”.

We exclude all data starting after Jan 2014 (in order to allow for enough of a performance window).

As shown in the figures below, a simple exploratory analysis shows some relationships which you’d expect (for example, those with a lower FICO score also have a higher propensity to default):

…as well as other relationships which might be more nuanced:

 

Create train & test datasets
We start by creating a “training” data set comprised of randomly selected 60% of the data – across all years. This dataset will be used to create our models of interest which will then be tested (out of sample) on the other 40%. We will also be evaluating the efficacy of our models using what’s known as the Receiver Operating Characteristic curve (or ROC curve). This curve simply plots the true positive rate against the false positive rate. So a model which is no better than a coin-flip will produce a straight line at a 45 degree angle…the more accurate our model, the more the curve will “bend” towards the upper left of the graph.

Using Recursive Partitioning Trees
A “traditional” approach to credit modeling would start with a univariate analysis of each variable (like the figures shown above) in an effort to evaluate the discriminatory power of each (using Kolmogorov-Smirnov or Chi-square tests, for example). Candidate variables would then be introduced into a Logistic regression in a step-wise fashion and the statistical significance would be assessed at each “step”. The variable would either be included or excluded. At each stage, the overall (and marginal change) in predictive power of the model is assessed and the processes ends when no more variables can be added without a sufficient increase in explanatory power.

Decision trees are one of the most popular methods for classification problems and will be the focus of this analysis. First, we create a simple decision tree using the Rpart (Recursive PARTitioning) package in R. This approach builds a “tree” by identifying the variable which does the “best” job of classifying our training dataset into our two groups of interest (we’ll define “best” later but suffice it to say, this a large body of work in itself). This same process is then applied to each new subgroup until no improvement can be made (or the subgroups become too “small” – as defined by the user).

Our training dataset contains 61k records of which approximately 19k (18.7%) have defaulted.  For each record, we have the following 27 data points which we will use to build a model for classifying defaults:

The Output
The first node in the tree is known as the “root node” and it’s simply giving us information about the population before
any splitting decisions have been made. This is telling us that we have 62,000 observations in our dataset (more precisely,
61,703) which represents 100% of the total population of which 19% are classified into our indicator of interest (default):

The next step in decision tree’s process is to find the variable (out of our group of 27) that best discriminates between  defaulters and non-defaulters. The “best” split is the one that is able to separate the data into two groups that each  demonstrates maximum homogeneity. The concept of “homogeneity” of groups (or nodes) can be thought of in the context of “impurity”. When all the members of a particular node belong to the same group (i.e. “defaulters”), the node is said to be “pure”. As the proportions of the other group increase (i.e. “non-defaulters”), the node becomes increasingly “impure”. The “goodness” of a split can be measured by the decrease in impurity that occurs.

There are a number of ways to measure the “impurity” of child nodes and thus many different splitting rules. For simplicity’s sake, we will use the most commonly used rule which is the Gini splitting rule.1 One of the appeals of the Gini splitting rule is that it works well with noisy data.  Using the Gini splitting rule, our decision tree iterates through all 27 variables and uses them to split the root node into two child nodes that are each as “pure” as possible. The decrease in impurity is measured and the top 5 variables are shown below:

 

CreditScoreRangeMid (basically the borrower’s average credit score) was identified as the variable that did the best job of separating our population into defaulters vs. non-defaulters:

The subset of our population that had an average credit score of less than 640 (of which there were 9000 or 15% of the total population), went on to default with a likelihood of 44%. Those in the other group (credit score >=640) only defaulted at a rate of 15%. As one would expect, Credit Score does a great job of segregating our population into two homogenous groups.

Now, the decision tree goes through exactly the same logic on each child node until we have “grown” a full tree:

So, after introducing two additional splitting variables, the tree stopped growing after classifying our root node into 5 “terminal” nodes. Without getting into details, this is based on various “complexity” parameters which do things like limit the minimum number of observations in the terminal nodes (so we don’t get 60k terminal nodes with only 1 borrower in each).

One of the huge appeals to decision trees is their ability to find interesting combinations of variables that do a good job of classifying. In this case, we see that even for those borrowers whose credit score is 640 or above, there is still a subset of them that still goes on to default at a rate of 29%. Specifically, the group whose employment status was anything other than “Employed” or “Other” AND got a Prosper loan at a rate GREATER than 14%:

If now look at how the decision tree performs on the out-of-sample dataset, we see that the simple decision tree, only using 3 variables, performs significantly better than a random guess:

If we relax our “complexity” parameters to allow the tree to “grow” further we can introduce more variables (BorrowerState, InquiriesLast6Months) and more splits. Now our initial population is being classified into 8 terminal nodes instead of only 5:

Observing how version 2 of our decision tree performs out of sample, we see a definite improvement in the ability to discriminate between “defaulters” and “non-defaulters”:

So, allowing the tree to grow a bit more introduced additional predictive power without overfitting the training dataset.
One drawback with decision trees is that they are unable to “change their mind” as new nodes are created. For example, it might be that we could’ve increased the power of the model by splitting on borrowerAPR earlier in the tree? This is where “random forests” come into play.

Using Random Forests
Random forests are what’s known as an ensemble learning method for classification. They work by constructing many “shallow” decision trees which are then averaged across the entire group – as opposed to spending a lot of time and effort building one “deep” decision tree. The idea is that the errors across the many shallow decision trees will be a wash and we will be left with a powerful model. Random forests have gained in popularity because they are quick and easy to implement and generally robust across datasets (i.e. not as prone to overfitting as when we’re growing 1 “deep” tree).

For this analysis, we create 2000 decision trees using random combinations of our explanatory variables. The results are shown below.
This table measures the accuracy of the models, specifically, how much worse the model performs when each variable is left out. So already, we see a huge decrease in accuracy after removing BorrowerAPR:

This table shows the Gini measure (the measure of “node impurity”) and shows the increase in node purity by including that variable. This time, we again see BorrowerAPR at the top of the list but CreditScore and EmploymentStatusDuration are right behind in terms of importance:

Finally, we take the predictions across all 2000 of our randomly created decision trees and see how their collective predicting power compares:

As you can see, the result is compelling – and appealing because (without divulging too much), it does a good job at capturing the non-linear interaction effects between variables which are difficult to identify with more traditional methods.

Conclusion
So, as can be seen, there’s a lot of valuable information to be leveraged in the P2P space that can put you leaps and bounds ahead of the game as opposed to simply filtering on a bunch of variables…and the nature of this data lends itself well to using more sophisticated methods. So invest with intelligence!