PCA on Fashion-MNIST Data
🤖

PCA on Fashion-MNIST Data

Date
Oct 1, 2023
Tags
Machine Learning
Data Science
Python
sk-learn
Jupyter Notebook

Description

In this exercise for Intro to Machine Learning with Applications class by Dr. Liang, we explore the application of Principal Component Analysis (PCA) on the Fashion-MNIST dataset. The primary objective is to visualize the data in 2D and generate new images using PCA as a generative model. The implementation utilizes the scikit-learn library, particularly the IncrementalPCA module for efficiency.

Data Sources

Fashion-MNIST dataset is employed for this task. It consists of grayscale images (28x28 pixels) of 10 different fashion categories.

Methodology

Loading the Dataset

The Fashion-MNIST dataset is loaded using the scikit-learn fetch_openml function. Images are reshaped into a numpy array, and class labels are extracted.

Data Visualization

A subset of images is displayed using matplotlib, providing an overview of the dataset.
Displaying one of the images
Displaying one of the images

PCA Visualization

The IncrementalPCA method is utilized for dimensionality reduction to 2 components. Visualizations include scatter plots of data points and separate plots for class labels 0 and 1. The plot_components function enhances the visualization by overlaying images on the scatter plot.
Plotting the data points in 2D
Plotting the data points in 2D
Display data points using plot_components
Display data points using plot_components
Display data points with class label 1 using plot_components
Display data points with class label 1 using plot_components
Plotting the data points with class label 0 (target) in 2D
Plotting the data points with class label 0 (target) in 2D
Display data points with class label 0 using plot_components
Display data points with class label 0 using plot_components
 

Generating New Images

The optimal number of components is determined by analyzing the cumulative explained variance ratio. New images are generated by combining random values with the mean and eigenvectors.
Plot the curve of 'percentage of variance explained' (0 - 1) vs n_components (0 - 100)
Plot the curve of 'percentage of variance explained' (0 - 1) vs n_components (0 - 100)
One of the images of eigenvectors
One of the images of eigenvectors
Plot the mean image from PCA
Plot the mean image from PCA
Display the newly generated images
Display the newly generated images

t-SNE Visualization

To complement PCA, t-SNE (t-Distributed Stochastic Neighbor Embedding) is applied to a subset of the data for additional visualization. t-SNE reduces dimensionality while preserving the pairwise similarities between data points, providing insights into the dataset's structure.
Run t-SNE on subset of data and visualize the data in 2D using scatter plot
Run t-SNE on subset of data and visualize the data in 2D using scatter plot

Conclusion

  • Data Visualization: PCA effectively reduces high-dimensional image data to 2D, providing a visual representation of the dataset's structure.
  • Component Analysis: The eigenvectors (components) extracted by PCA reveal meaningful features in the images.
  • Image Generation: While PCA is not optimal for image generation, the process of combining mean and eigenvectors to create new images is demonstrated.
  • t-SNE Visualization: t-SNE offers an alternative perspective, emphasizing similarities between data points in 2D space.
PCA proves valuable for visualizing high-dimensional image data and extracting meaningful features. However, its limitations in image generation highlight the need for alternative methods, such as those based on neural networks. The addition of t-SNE provides a complementary view of the dataset's structure, enhancing the overall understanding.
 

👋🏻 Let’s chat!

If you are interested in working with me, have a project in mind, or just want to say hello, please don't hesitate to contact me.

Find me here 👇🏻

notion image
Please do not steal my work. It took uncountable cups of coffee and sleepless nights. Thank you.