by Fang Zhou, Data Scientist; Graham Williams, Director of Data Science, all at Microsoft
Credit Risk Scoring is a classic but increasingly important operation in banking as banks are becoming far more risk careful when lending for mortgages, credit card payments or other commercial purposes, in an industry known for fierce competition and the global financial crisis. With an accurate credit risk scoring model, a bank is able to predict the likelihood of default on a transaction. This will in turn help evaluate the potential risk posed by lending money to consumers and to mitigate losses due to bad debt, as well as determine who qualifies for a loan, at what interest rate, and what credit limits, and even determine which customers are likely to bring in the most revenue through a variety of products.
Many banks nowadays are driving innovation to enhance risk management. For example, a largest bank in one of the Asian countries by market capitalization is exploring opportunities to segment a million of active credit card customer population to improve risk scoring to then identify opportunities to offer increased limits. Using advanced analytics for credit risk scoring involves traditional scorecard building and modelling, and extends to machine learning and ensemble, but will also pursue an innovation on customer oriented aggregation of transactions, multi-dimensional customer segmentation and conceptual clustering to identify multiple segments across which to understand bank customers.
In the data-driven credit risk prediction model, normally two types of data are taken into consideration.
- Transaction dataTransaction records cover transaction id, account id, transaction date, transaction amount, merchant industry, etc. This transaction-level data could be dynamically aggregated and then provide transaction statistics and financial behavior information at account level.
- Demographic and bank account informationThis type of data show the characteristics of individual customer or account credit bureau, such as age, sex, income, and credit limit. They are static and never change or solely increment deterministically over time.
The following graphic shows the data schema and the workflow for credit card fraud risk prediction.
In recent years, R has been gaining in popularity over SAS among statisticians and data scientists in solving variety of industrial business problems, including Financial Services and Risk Management. A data science accelerator for credit risk prediction is now shared in the github repository. This accelerator consists of four R templates which walk through the process of model development, scale-up and speed-up, deployment, and application development.
CreditRiskPrediction: Data-driven credit risk prediction in R, covering techniques of exploratory analytics, data aggregation, merging and cleansing, feature engineering, but more importantly, model building and evaluation.
CreditRiskScale: Faster and scalable credit risk models with Microsoft R Server , using the state-of-the-art machine learning algorithms provided by the MicrosoftML.
CreditRiskScale (part 2): Train multiple ML models with hyper-parameter selection in parallel by using rxExec().
CreditRiskDeploy: Deploy a credit risk model as a web service with R Server Operationalization, leveraging the mrsdeploy package.
CreditRiskShinyApp: Credit risk application through REST API, with integration with the Shiny framework.
You can find the data science accelerator for credit risk prediction on GitHub. The R scripts are provided as R markdown files, which you can use to generate documents in formats including PDF, html, and Word. By no means is this the endpoint of the data science journey. The accelerator is under regular revision and improvement and we welcome feedback via pull requests and issues.
Github (Microsoft): Data Science Accelerator - Credit Risk Prediction