Predicting-Breast-Cancer-Diagnoses
TjanMichela • Updated May 2, 2024
Contributions by Michela Tjan Effendie and Michelle Manfrini.
Description
Breast cancer is one of the leading causes of death among women, making it important to begin screening for breast cancer early in order to increase the chances of successful treatments. A robust model can assist medical professionals in identifying cases and reducing breast cancer risk. Our project proposed four breast cancer prediction models based on logistic regression, K-Nearest Neighbors Classifier (KNN), Linear Discriminant Analysis (LDA), and Quadratic Discriminant Analysis (QDA).
Framework
- Problem definition
- Data
- Evaluation
- Features
- Modelling
- Experimentation
Research Questions
Based on a model trained on synthetic data, how well can said model predict whether or not a clinical patient have breast cancer? Can the model accurately predict whether an individual has breast cancer based on the provided predictors? Which attributes are significant in distinguishing between healthy and affected individuals? How well does the model perform in terms of accuracy and reliability? How clinically relevant and applicable is the model? Can it be used by healthcare professionals for early-stage breast cancer detection?
Data Sources
- Two datasets:
- The original “Breast Cancer Coimbra” dataset found in the UC Irvine Machine Learning Repository, with 116 instances based on clinical observations from 64 patients with breast cancer and 52 healthy controls.
- Age (years): Represents the age of individuals in the dataset.
- BMI (kg/m²): Body Mass Index, a measure of body fat based on weight and height.
- Glucose (mg/dL): Reflects blood glucose levels, a vital metabolic indicator.
- Insulin (µU/mL): Indicates insulin levels, a hormone associated with glucose regulation.
- HOMA: Homeostatic Model Assessment, a method assessing insulin resistance and beta-cell function.
- Leptin (ng/mL): Represents leptin levels, a hormone involved in appetite and energy balance regulation.
- Adiponectin (µg/mL): Reflects adiponectin levels, a protein associated with metabolic regulation.
- Resistin (ng/mL): Indicates resistin levels, a protein implicated in insulin resistance.
- MCP-1 (pg/dL): Reflects Monocyte Chemoattractant Protein-1 levels, a cytokine involved in inflammation.
- Labels:
- 1: Healthy controls
- 2: Patients with breast cancer
- The derived dataset from Kaggle which is a synthetic dataset from a deep learning model that was trained on the original. The synthetic dataset has 4,000 instances.
- Age (years): Represents the age of individuals in the dataset.
- BMI (kg/m²): Body Mass Index, a measure of body fat based on weight and height.
- Glucose (mg/dL): Reflects blood glucose levels, a vital metabolic indicator.
- Insulin (µU/mL): Indicates insulin levels, a hormone associated with glucose regulation.
- HOMA: Homeostatic Model Assessment, a method assessing insulin resistance and beta-cell function.
- Leptin (ng/mL): Represents leptin levels, a hormone involved in appetite and energy balance regulation.
- Adiponectin (µg/mL): Reflects adiponectin levels, a protein associated with metabolic regulation.
- Resistin (ng/mL): Indicates resistin levels, a protein implicated in insulin resistance.
- MCP-1 (pg/dL): Reflects Monocyte Chemoattractant Protein-1 levels, a cytokine involved in inflammation.
- Labels:
- 1: Healthy controls
- 2: Patients with breast cancer
Features
Features
Results
Results reveal that none of the models significantly outperform the other models. Although KNN seems to be the highest performing model based on all the evaluation metrics including the confusion matrix, it does not have a high tendency to predict one class over the other.