Credit Risk Analysis Using Python – Data Science Projects

Credit Risk Analysis Using Python

Credit risk analysis plays a crucial role in financial institutions, enabling them to assess the likelihood of borrowers defaulting on their loans. With the increasing availability of data and advancements in technology, data science has emerged as a powerful tool for analyzing credit risk. In this article, we will explore how Python, a popular programming language for data analysis, can be leveraged to conduct credit risk analysis efficiently and effectively.

Credit Risk Analysis: Understanding

Credit risk analysis is a critical process in financial institutions, involving the evaluation of borrowers’ creditworthiness. By assessing credit risk, lenders can make informed decisions regarding loan approvals, interest rates, and credit limits. This analysis is crucial for ensuring the stability and profitability of financial institutions.

Credit Risk Analysis: Overview and Importance

To perform credit risk analysis, several factors need to be considered, such as the borrower’s credit history, income, employment stability, and outstanding debts. Traditionally, this analysis was conducted manually, relying on subjective judgments and limited data. However, with the advent of data science and machine learning techniques, credit risk analysis has become more accurate and efficient.

Credit Risk Analysis: Leveraging Python

Python, a versatile programming language, offers a wide range of libraries and tools that facilitate credit risk analysis. Let’s explore some of the key benefits and tools that Python provides for this purpose.

Credit Risk Analysis: Python Benefits

Pandas:

Pandas is a powerful data manipulation library that allows for efficient data preprocessing and analysis. It provides a comprehensive set of functions for data cleaning, transformation, and aggregation, enabling analysts to prepare the data for credit risk modeling.

NumPy:

NumPy is a fundamental library for scientific computing in Python. It offers powerful mathematical functions and data structures, essential for numerical operations involved in credit risk analysis, such as calculating statistical measures and performing matrix operations.

Scikit-learn:

Scikit-learn is a popular machine-learning library that provides a wide range of algorithms for classification and regression tasks. It includes algorithms like logistic regression, random forests, and support vector machines, which can be applied to credit risk analysis to build predictive models.

XGBoost:

XGBoost is an efficient gradient-boosting library that excels in handling structured data. It can be used to build high-performance credit risk models by combining the predictions of multiple weak models, ultimately leading to better accuracy and reliability.

TensorFlow:

TensorFlow, an open-source deep learning framework, is useful for complex credit risk analysis tasks that involve neural networks. It provides a flexible architecture for building and training deep learning models, enabling the analysis of intricate patterns and dependencies within the data.

Keras:

Keras is a high-level neural network library that runs on top of TensorFlow. It simplifies the process of building and training deep learning models, making them accessible to data scientists with varying levels of expertise. Keras is particularly beneficial for credit risk analysis involving deep neural networks.

Try to explore our other Projects also: Rainfall Prediction Using Python: Data Science Project

Data Science Projects for Credit Risk Analysis

To gain practical experience in credit risk analysis using Python, engaging in data science projects can be immensely valuable. Let’s explore a few project ideas that can help you strengthen your skills and understanding in this domain.

Credit Risk Analysis: Case Studies and Implementation

  1. Default Prediction: Develop a predictive model that identifies borrowers at risk of defaulting on their loans. Use historical credit data to train and test the model, evaluating its performance using appropriate evaluation metrics such as accuracy, precision, and recall.
  2. Credit Score Modeling: Build a credit score model that assigns a numerical value to represent an individual’s creditworthiness. Use a dataset with relevant features such as credit history, income, and outstanding debts. Implement feature engineering techniques and apply various machine learning algorithms to determine the most effective credit scoring model.
  3. Loan Approval Automation: Create an automated system that assesses loan applications and predicts the probability of loan approval based on predefined criteria. Employ machine learning techniques to develop a model that accurately predicts loan approval outcomes, improving the efficiency of the loan approval process.

Credit Risk Analysis Using Python: Code

You can find the dataset and code notebook on GitHub.

1: Importing Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.formula.api as sm
import scipy.stats as stats
import pandas_profiling   #need to install using anaconda prompt (pip install pandas_profiling)

%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()

from matplotlib.backends.backend_pdf import PdfPages

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from statsmodels.stats.outliers_influence import variance_inflation_factor
from patsy import dmatrices

2: Data Pre-Processing

bankloans=pd.read_csv('../input/bankloans.csv')
len(bankloans)
## Generic functions for data explorations
def var_summary(x):
    return pd.Series([x.count(), x.isnull().sum(), x.sum(), x.mean(), x.median(),  x.std(), x.var(), x.min(), x.dropna().quantile(0.01), x.dropna().quantile(0.05),x.dropna().quantile(0.10),x.dropna().quantile(0.25),x.dropna().quantile(0.50),x.dropna().quantile(0.75), x.dropna().quantile(0.90),x.dropna().quantile(0.95), x.dropna().quantile(0.99),x.max()], 
                  index=['N', 'NMISS', 'SUM', 'MEAN','MEDIAN', 'STD', 'VAR', 'MIN', 'P1' , 'P5' ,'P10' ,'P25' ,'P50' ,'P75' ,'P90' ,'P95' ,'P99' ,'MAX'])


def cat_summary(x):
    return pd.Series([x.count(), x.isnull().sum(), x.value_counts()], 
                  index=['N', 'NMISS', 'ColumnsNames'])

def create_dummies( df, colname ):
    col_dummies = pd.get_dummies(df[colname], prefix=colname)
    col_dummies.drop(col_dummies.columns[0], axis=1, inplace=True)
    df = pd.concat([df, col_dummies], axis=1)
    df.drop( colname, axis = 1, inplace = True )
    return df

#Handling outliers
def outlier_capping(x):
    x = x.clip_upper(x.quantile(0.99))
    x = x.clip_lower(x.quantile(0.01))
    return x

def Missing_imputation(x):
    x = x.fillna(x.mean())
    return x
bankloans.apply(lambda x: var_summary(x)).T
bankloans_existing = bankloans[bankloans.default.isnull()==0]
bankloans_new = bankloans[bankloans.default.isnull()==1]
bankloans_existing=bankloans_existing.apply(lambda x: outlier_capping(x))
bankloans_existing=bankloans_existing.apply(lambda x: Missing_imputation(x))
numeric_var_names=[key for key in dict(bankloans.dtypes) if dict(bankloans.dtypes)[key] in ['float64', 'int64', 'float32', 'int32']]
cat_var_names=[key for key in dict(bankloans.dtypes) if dict(bankloans.dtypes)[key] in ['object']]
sns.heatmap(bankloans_existing.corr())

The above correlation depicts that there is a very strong correlation between default ~ employ, address, and income Also debuting shows a very strong relationship with income, and is expected as the same is derived variable from income and debt.

Let’s try to understand the impact of each independent variable on the dependent variable with a whiskers plot.

bp = PdfPages('BoxPlots with default Split.pdf')

for num_variable in numeric_var_names:
    fig,axes = plt.subplots(figsize=(10,4))
    sns.boxplot(x='default', y=num_variable, data = bankloans_existing)
    bp.savefig(fig)
bp.close()

3: EDA

tstats_df = pd.DataFrame()
for num_variable in bankloans_existing.columns.difference(['default']):
    tstats=stats.ttest_ind(bankloans_existing[bankloans_existing.default==1][num_variable],bankloans_existing[bankloans_existing.default==0][num_variable])
    temp = pd.DataFrame([num_variable, tstats[0], tstats[1]]).T
    temp.columns = ['Variable Name', 'T-Statistic', 'P-Value']
    tstats_df = pd.concat([tstats_df, temp], axis=0, ignore_index=True)
print(tstats_df)

4: Visualization of variable importance

for num_variable in numeric_var_names:
    fig,axes = plt.subplots(figsize=(10,4))
    #sns.distplot(hrdf[num_variable], kde=False, color='g', hist=True)
    sns.distplot(bankloans_existing[bankloans_existing['default']==0][num_variable], label='Not Default', color='b', hist=True, norm_hist=False)
    sns.distplot(bankloans_existing[bankloans_existing['default']==1][num_variable], label='Default', color='r', hist=True, norm_hist=False)
    plt.xlabel(str("X variable ") + str(num_variable) )
    plt.ylabel('Density Function')
    plt.title(str('Default Split Density Plot of ')+str(num_variable))
    plt.legend()

5: Variable Transformation: Bucketing

bp = PdfPages('Transformation Plots.pdf')

for num_variable in bankloans_existing.columns.difference(['default']):
    binned = pd.cut(bankloans_existing[num_variable], bins=10, labels=list(range(1,11)))
    binned = binned.dropna()
    ser = bankloans_existing.groupby(binned)['default'].sum() / (bankloans_existing.groupby(binned)['default'].count()-bankloans_existing.groupby(binned)['default'].sum())
    ser = np.log(ser)
    fig,axes = plt.subplots(figsize=(10,4))
    sns.barplot(x=ser.index,y=ser)
    plt.ylabel('Log Odds Ratio')
    plt.title(str('Logit Plot for identifying if the bucketing is required or not for variable ') + str(num_variable))
    bp.savefig(fig)

bp.close()
print('These variables need bucketing - creddebt, othdebt, debtinc, employ, income ')
bankloans_existing.columns
bankloans_existing[['creddebt', 'othdebt', 'debtinc', 'employ','income' ]].describe(percentiles=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]).T
features = "+".join(bankloans_existing.columns.difference(['default']))
a,b = dmatrices(formula_like='default ~ '+ features, data = bankloans_existing, return_type='dataframe')

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(b.values, i) for i in range(b.shape[1])]
vif["features"] = b.columns

print(vif)

6: Model Building

train_features = bankloans_existing.columns.difference(['default'])
train_X, test_X = train_test_split(bankloans_existing, test_size=0.3, random_state=42)
train_X.columns
logreg = sm.logit(formula='default ~ ' + "+".join(train_features), data=train_X)
result = logreg.fit()
summ = result.summary()
summ
AUC = metrics.roc_auc_score(train_X['default'], result.predict(train_X))

print('AUC is -> ' + str(AUC))
train_gini = 2*metrics.roc_auc_score(train_X['default'], result.predict(train_X)) - 1
print("The Gini Index for the model built on the Train Data is : ", train_gini)

test_gini = 2*metrics.roc_auc_score(test_X['default'], result.predict(test_X)) - 1
print("The Gini Index for the model built on the Test Data is : ", test_gini)
train_predicted_prob = pd.DataFrame(result.predict(train_X))
train_predicted_prob.columns = ['prob']
train_actual = train_X['default']
# making a DataFrame with actual and prob columns
train_predict = pd.concat([train_actual, train_predicted_prob], axis=1)
train_predict.columns = ['actual','prob']

test_predicted_prob = pd.DataFrame(result.predict(test_X))
test_predicted_prob.columns = ['prob']
test_actual = test_X['default']
# making a DataFrame with actual and prob columns
test_predict = pd.concat([test_actual, test_predicted_prob], axis=1)
test_predict.columns = ['actual','prob']

## Intuition behind ROC curve - predicted probability as a tool for separating the '1's and '0's
def cut_off_calculation(result,train_X,train_predict):
    
    roc_like_df = pd.DataFrame()
    train_temp = train_predict.copy()

    for cut_off in np.linspace(0,1,50):
        train_temp['cut_off'] = cut_off
        train_temp['predicted'] = train_temp['prob'].apply(lambda x: 0.0 if x < cut_off else 1.0)
        train_temp['tp'] = train_temp.apply(lambda x: 1.0 if x['actual']==1.0 and x['predicted']==1 else 0.0, axis=1)
        train_temp['fp'] = train_temp.apply(lambda x: 1.0 if x['actual']==0.0 and x['predicted']==1 else 0.0, axis=1)
        train_temp['tn'] = train_temp.apply(lambda x: 1.0 if x['actual']==0.0 and x['predicted']==0 else 0.0, axis=1)
        train_temp['fn'] = train_temp.apply(lambda x: 1.0 if x['actual']==1.0 and x['predicted']==0 else 0.0, axis=1)
        sensitivity = train_temp['tp'].sum() / (train_temp['tp'].sum() + train_temp['fn'].sum())
        specificity = train_temp['tn'].sum() / (train_temp['tn'].sum() + train_temp['fp'].sum())
        roc_like_table = pd.DataFrame([cut_off, sensitivity, specificity]).T
        roc_like_table.columns = ['cutoff', 'sensitivity', 'specificity']
        roc_like_df = pd.concat([roc_like_df, roc_like_table], axis=0)
    return roc_like_df

roc_like_df = cut_off_calculation(result,train_X,train_predict)
## Finding ideal cut-off for checking if this remains same in OOS validation
roc_like_df['total'] = roc_like_df['sensitivity'] + roc_like_df['specificity']
roc_like_df[roc_like_df['total']==roc_like_df['total'].max()]
train_predict['predicted'] = train_predict['prob'].apply(lambda x: 1 if x > 0.24 else 0)
sns.heatmap(pd.crosstab(train_predict['actual'], train_predict['predicted']), annot=True, fmt='.0f')
plt.title('Train Data Confusion Matrix')
plt.show()

test_predict['predicted'] = test_predict['prob'].apply(lambda x: 1 if x > 0.24 else 0)
sns.heatmap(pd.crosstab(test_predict['actual'], test_predict['predicted']), annot=True, fmt='.0f')
plt.title('Train Data Confusion Matrix')
plt.show()

# (117+236)/(117+236+120+17)
print("The overall accuracy score for the Train Data is : ", metrics.accuracy_score(train_predict.actual, train_predict.predicted))
print("The overall accuracy score for the Test Data  is : ", metrics.accuracy_score(test_predict.actual, test_predict.predicted))

7: Decile Analysis

train_predict['Deciles']=pd.qcut(train_predict['prob'],10, labels=False)
#test['Deciles']=pd.qcut(test['prob'],10, labels=False)
train_predict.head()
df = train_predict[['Deciles','actual']].groupby(train_predict.Deciles).sum().sort_index(ascending=False)
df

8: Model Implementation

train_features = bankloans_existing.columns.difference(['default'])
train_sk_X,test_sk_X, train_sk_Y ,test_sk_Y = train_test_split(bankloans_existing[train_features],bankloans_existing['default'], test_size=0.3, random_state=42)
train_sk_X.columns
logisticRegr = LogisticRegression()
logisticRegr.fit(train_sk_X, train_sk_Y)
#Predicting the test cases
train_pred = pd.DataFrame({'actual':train_sk_Y,'predicted':logisticRegr.predict(train_sk_X)})
train_pred = train_pred.reset_index()
train_pred.drop(labels='index',axis=1,inplace=True)
train_gini = 2*metrics.roc_auc_score(train_sk_Y, logisticRegr.predict(train_sk_X)) - 1
print("The Gini Index for the model built on the Train Data is : ", train_gini)

test_gini = 2*metrics.roc_auc_score(test_sk_Y, result.predict(test_sk_X)) - 1
print("The Gini Index for the model built on the Test Data is : ", test_gini)
predict_proba_df = pd.DataFrame(logisticRegr.predict_proba(train_sk_X))
hr_test_pred = pd.concat([train_pred,predict_proba_df],axis=1)
hr_test_pred.columns=['actual','predicted','Left_0','Left_1']
auc_score = metrics.roc_auc_score( hr_test_pred.actual, hr_test_pred.Left_1  )
round( float( auc_score ), 2 )
# Finding the optimal cutoff probability
fpr, tpr, thresholds = metrics.roc_curve( hr_test_pred.actual,hr_test_pred.Left_1,drop_intermediate=False )
plt.figure(figsize=(6, 4))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
cutoff_prob = thresholds[(np.abs(tpr - 0.72)).argmin()]
cutoff_prob
hr_test_pred['new_labels'] = hr_test_pred['Left_1'].map( lambda x: 1 if x >= 0.36 else 0 )
print("The overall accuracy score for the Train Data is : ", round(metrics.accuracy_score(train_sk_Y, logisticRegr.predict(train_sk_X)),2))
print("The overall accuracy score for the Test Data is : ", round(metrics.accuracy_score(test_sk_Y, logisticRegr.predict(test_sk_X)),2))

9: Creating a Confusion Matrix

# Creating a confusion matrix

from sklearn import metrics

cm_train = metrics.confusion_matrix(hr_test_pred['actual'],
                            hr_test_pred['new_labels'], [1,0] )
sns.heatmap(cm_train,annot=True, fmt='.0f')
plt.title('Train Data Confusion Matrix')
plt.show()

Conclusion

Credit risk analysis is a critical aspect of financial decision-making, enabling lenders to assess the likelihood of borrowers defaulting on their loans. With Python and its rich ecosystem of data science tools and libraries, analysts can efficiently conduct credit risk analysis and develop predictive models for improved decision-making.

By leveraging Python’s powerful libraries such as Pandas, NumPy, sci-kit-learn, XGBoost, TensorFlow, and Keras, analysts can preprocess data, apply various algorithms, and build accurate credit risk models. Engaging in data science projects focused on credit risk analysis allows practitioners to gain hands-on experience and enhance their skills in this domain.

As data science continues to revolutionize the financial industry, mastering credit risk analysis using Python opens up exciting opportunities for professionals to contribute to the stability and success of financial institutions.

Remember, data science projects are not only educational but also a chance to explore your creativity and problem-solving skills. So, dive into the world of credit risk analysis with Python and unlock new possibilities in the field of finance.


Leave a Comment