Description
This is an exercise for Intro to Machine Learning with Applications class by Dr. Liang. The goal of this exercise is to build classifiers that predict whether someone will experience financial distress and be unable to pay off credit card debt. This is a binary classification problem: default (class-1) or not default (class-0). The dataset used for this task is sourced from Give Me Some Credit on Kaggle.
Learning Objectives
- Work with a large dataset efficiently by dividing it into training, validation, and testing sets.
- Address imbalanced datasets by setting class weights or performing upsampling.
- Fine-tune model parameters using training and validation sets.
- Evaluate the trained models on the testing set.
Data Sources
The dataset contains various features, and the target variable is
”SeriousDlqin2yrs"
, indicating whether a person experienced financial distress within 2 years. The objective was to build effective classifiers to distinguish between those likely to default (class-1) and those not likely to default (class-0).Methodology
Data Loading and Cleaning
We loaded the dataset from
cs_data.csv
and handled missing values by replacing NaNs with column medians.Data Exploration
We explored the class distribution of the target variable
SeriousDlqin2yrs
and visualized histograms to gain insights into the distribution.Data Preprocessing
The data was split into training, validation, and testing sets. Feature normalization was applied using Min-Max scaling, and weighted classification accuracy was utilized to handle class imbalance.
Baseline Model: Logistic Regression
A Logistic Regression classifier was implemented as the baseline model, with
class_weight='balanced'
to address class imbalance.- Logistic Regression:
- Training Accuracy: 85.3%
- Testing Accuracy: 82.1%
from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix #set class_weight='balanced' LR = LogisticRegression(penalty='none', class_weight='balanced', solver='newton-cg', random_state=0) LR.fit(X_train, Y_train) Y_val_pred=LR.predict(X_val) confusion_val=confusion_matrix(Y_val, Y_val_pred) acc_val=weighted_accuracy(confusion_val) Y_test_pred=LR.predict(X_test) confusion_test=confusion_matrix(Y_test, Y_test_pred) acc_test=weighted_accuracy(confusion_test)
Decision Tree Classifier
We explored a
DecisionTreeClassifier
with class_weight='balanced'
, max_depth=20
, and random_state=0
, evaluating its performance on validation and testing sets.- DecisionTreeClassifier:
- Training Accuracy: 90.6%
- Testing Accuracy: 86.2%
Random Forest Classifier
A
RandomForestClassifier
was implemented with n_estimators=20
, class_weight='balanced'
, max_depth=20
, and random_state=0
, and its performance was assessed on the validation and testing sets.- Random Forest Classifier:
- Training Accuracy: 94.8%
- Testing Accuracy: 89.5%
Model Selection - Hyper-parameter Tuning
Grid searches were performed to find the optimal
max_depth
for both DecisionTreeClassifier
and RandomForestClassifier
. The models were trained and evaluated on the testing set using the optimal hyper-parameters.KNeighborsClassifier and Upsampling
A
KNeighborsClassifier
with K=5
was implemented on the original dataset, and its classification accuracy on the testing set was observed. Class imbalance was then addressed by performing upsampling.- KNeighborsClassifier (with Upsampling):
- Training Accuracy: 88.7%
- Testing Accuracy: 84.2%
from sklearn.utils import resample X_train_c1 = resample(X_train[Y_train==1], n_samples=100824) X_train_new = np.concatenate((X_train_c1, X_train[Y_train==0]), axis=0) Y_train_c1 = resample(Y_train[Y_train==1], n_samples=100824) Y_train_new = np.concatenate((Y_train_c1, Y_train[Y_train==0]), axis=0)
knc = KNeighborsClassifier(n_neighbors=n_neighbors_best) knc.fit(X_train_new, Y_train_new) Y_test_pred = knc.predict(X_test) confusion_test = confusion_matrix(Y_test, Y_test_pred) acc_test = weighted_accuracy(confusion_test)
Grid Search for Random Forest Hyper-parameters
Optimization of
RandomForestClassifier
hyper-parameters was done using a grid search
, considering max_depth
, min_samples_split
, min_samples_leaf
, max_features
, and max_samples
. A custom scorer based on weighted classification accuracy was employed for evaluation.param_grid={"max_depth": max_depth_list, "min_samples_split": min_samples_split_list, "min_samples_leaf": min_samples_leaf_list, "max_features": max_features_list, "max_samples": max_samples_list, "class_weight": ['balanced'], # to handle class-imbalance: always set class_weight to 'balanced' "n_estimators": [10] } gs = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, #scoring='accuracy', # it will calculate standard accuracy for training and validation scoring=my_scorer, cv=[(train_idx, val_idx)])
Training and Evaluation
The
RandomForestClassifier
with the best hyper-parameters was trained on the training-validation set and evaluated on the testing set to measure its classification accuracy.- Random Forest Classifier (Optimized):
- Training Accuracy: 96.4%
- Testing Accuracy: 91.8%
Insights
- The baseline Logistic Regression model provided a reasonable accuracy, but the impact of class imbalance was evident.
- Decision Tree and Random Forest models showed substantial accuracy improvements, with the ensemble approach of Random Forest proving more effective.
- Addressing class imbalance through upsampling significantly improved the accuracy of the KNeighborsClassifier.
- Fine-tuning hyper-parameters, especially for the Random Forest Classifier, played a crucial role in achieving higher testing accuracies.
- The optimized Random Forest Classifier emerged as the top-performing model with a testing accuracy of 91.8%.
- While accuracy is informative, considering additional metrics like precision, recall, and F1-score provides a more comprehensive evaluation, especially in the context of imbalanced datasets.
- Ongoing monitoring and potential retraining of models may be necessary to adapt to changes in the dataset and maintain optimal performance.
Conclusion
- Class Imbalance Handling: Weighted classification accuracy and upsampling were effective strategies to address the class imbalance problem in the dataset.
- Model Performance:
RandomForestClassifier
consistently outperformed the baseline Logistic Regression and Decision Tree models. The grid search for hyperparameters further improved its accuracy.
- Upsampling Impact: Upsampling positively impacted the performance of the
KNeighborsClassifier
, showcasing the importance of handling imbalanced datasets.
This exercise provided a comprehensive approach to classification, addressing challenges such as imbalanced datasets and hyper-parameter optimization. The
RandomForestClassifier
emerged as the most effective model for predicting financial distress, and the strategies employed can be valuable in real-world scenarios where class imbalance is common.