Classification on Credit Score Data

Description

This is an exercise for Intro to Machine Learning with Applications class by Dr. Liang. The goal of this exercise is to build classifiers that predict whether someone will experience financial distress and be unable to pay off credit card debt. This is a binary classification problem: default (class-1) or not default (class-0). The dataset used for this task is sourced from Give Me Some Credit on Kaggle.

Learning Objectives

Work with a large dataset efficiently by dividing it into training, validation, and testing sets.

Address imbalanced datasets by setting class weights or performing upsampling.

Fine-tune model parameters using training and validation sets.

Evaluate the trained models on the testing set.

Data Sources

The dataset contains various features, and the target variable is ”SeriousDlqin2yrs", indicating whether a person experienced financial distress within 2 years. The objective was to build effective classifiers to distinguish between those likely to default (class-1) and those not likely to default (class-0).

Methodology

Data Loading and Cleaning

We loaded the dataset from cs_data.csv and handled missing values by replacing NaNs with column medians.

Data Exploration

We explored the class distribution of the target variable SeriousDlqin2yrs and visualized histograms to gain insights into the distribution.

Histogram of SeriousDlqin2yrs to view distribution of classes — Histogram of `SeriousDlqin2yrs` to view distribution of classes

Data Preprocessing

The data was split into training, validation, and testing sets. Feature normalization was applied using Min-Max scaling, and weighted classification accuracy was utilized to handle class imbalance.

Baseline Model: Logistic Regression

A Logistic Regression classifier was implemented as the baseline model, with class_weight='balanced' to address class imbalance.

Logistic Regression:

Training Accuracy: 85.3%
Testing Accuracy: 82.1%


from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

#set class_weight='balanced'
LR = LogisticRegression(penalty='none', class_weight='balanced', solver='newton-cg', random_state=0)
LR.fit(X_train, Y_train)

Y_val_pred=LR.predict(X_val)
confusion_val=confusion_matrix(Y_val, Y_val_pred)
acc_val=weighted_accuracy(confusion_val)

Y_test_pred=LR.predict(X_test)
confusion_test=confusion_matrix(Y_test, Y_test_pred)
acc_test=weighted_accuracy(confusion_test)

Decision Tree Classifier

We explored a DecisionTreeClassifier with class_weight='balanced', max_depth=20, and random_state=0, evaluating its performance on validation and testing sets.

DecisionTreeClassifier:

Training Accuracy: 90.6%
Testing Accuracy: 86.2%

Random Forest Classifier

A RandomForestClassifier was implemented with n_estimators=20, class_weight='balanced', max_depth=20, and random_state=0, and its performance was assessed on the validation and testing sets.

Random Forest Classifier:

Training Accuracy: 94.8%
Testing Accuracy: 89.5%

Model Selection - Hyper-parameter Tuning

Grid searches were performed to find the optimal max_depth for both DecisionTreeClassifier and RandomForestClassifier. The models were trained and evaluated on the testing set using the optimal hyper-parameters.

Visualize max_depth vs training accuracy and max_depth vs validation accuracy to find the best hyper-parameter — Visualize `max_depth` vs training accuracy and `max_depth` vs validation accuracy to find the best hyper-parameter

KNeighborsClassifier and Upsampling

A KNeighborsClassifier with K=5 was implemented on the original dataset, and its classification accuracy on the testing set was observed. Class imbalance was then addressed by performing upsampling.

KNeighborsClassifier (with Upsampling):

Training Accuracy: 88.7%
Testing Accuracy: 84.2%


from sklearn.utils import resample

X_train_c1 = resample(X_train[Y_train==1], n_samples=100824)
X_train_new = np.concatenate((X_train_c1, X_train[Y_train==0]), axis=0)

Y_train_c1 = resample(Y_train[Y_train==1], n_samples=100824)
Y_train_new = np.concatenate((Y_train_c1, Y_train[Y_train==0]), axis=0)


knc = KNeighborsClassifier(n_neighbors=n_neighbors_best)
knc.fit(X_train_new, Y_train_new)

Y_test_pred = knc.predict(X_test)
confusion_test = confusion_matrix(Y_test, Y_test_pred)
acc_test = weighted_accuracy(confusion_test)

Grid Search for Random Forest Hyper-parameters

Optimization of RandomForestClassifier hyper-parameters was done using a grid search, considering max_depth, min_samples_split, min_samples_leaf, max_features, and max_samples. A custom scorer based on weighted classification accuracy was employed for evaluation.


param_grid={"max_depth": max_depth_list,          
           "min_samples_split": min_samples_split_list,
           "min_samples_leaf": min_samples_leaf_list,  
           "max_features": max_features_list,      
           "max_samples": max_samples_list,        
           "class_weight": ['balanced'], # to handle class-imbalance: always set class_weight to 'balanced'
           "n_estimators": [10]
            }
gs = GridSearchCV(estimator=RandomForestClassifier(),
                  param_grid=param_grid,
                  #scoring='accuracy', # it will calculate standard accuracy for training and validation
                  scoring=my_scorer,
                  cv=[(train_idx, val_idx)])

Training and Evaluation

The RandomForestClassifier with the best hyper-parameters was trained on the training-validation set and evaluated on the testing set to measure its classification accuracy.

Random Forest Classifier (Optimized):

Training Accuracy: 96.4%
Testing Accuracy: 91.8%

Insights

The baseline Logistic Regression model provided a reasonable accuracy, but the impact of class imbalance was evident.

Decision Tree and Random Forest models showed substantial accuracy improvements, with the ensemble approach of Random Forest proving more effective.

Addressing class imbalance through upsampling significantly improved the accuracy of the KNeighborsClassifier.

Fine-tuning hyper-parameters, especially for the Random Forest Classifier, played a crucial role in achieving higher testing accuracies.

The optimized Random Forest Classifier emerged as the top-performing model with a testing accuracy of 91.8%.

While accuracy is informative, considering additional metrics like precision, recall, and F1-score provides a more comprehensive evaluation, especially in the context of imbalanced datasets.

Ongoing monitoring and potential retraining of models may be necessary to adapt to changes in the dataset and maintain optimal performance.

Conclusion

Class Imbalance Handling: Weighted classification accuracy and upsampling were effective strategies to address the class imbalance problem in the dataset.

Model Performance: RandomForestClassifier consistently outperformed the baseline Logistic Regression and Decision Tree models. The grid search for hyperparameters further improved its accuracy.

Upsampling Impact: Upsampling positively impacted the performance of the KNeighborsClassifier, showcasing the importance of handling imbalanced datasets.

This exercise provided a comprehensive approach to classification, addressing challenges such as imbalanced datasets and hyper-parameter optimization. The RandomForestClassifier emerged as the most effective model for predicting financial distress, and the strategies employed can be valuable in real-world scenarios where class imbalance is common.

👋🏻 Let’s chat!

📝

My Resume →

If you are interested in working with me, have a project in mind, or just want to say hello, please don't hesitate to contact me.

Find me here 👇🏻

📨 Email →

💼 LinkedIn →

👩‍💻 Github →

✍️ Blog Posts →

🥺 Tip Me →

Please do not steal my work. It took uncountable cups of coffee and sleepless nights. Thank you.