Description
In this project for Intro to Machine Learning with Applications class by Dr. Liang, the primary task is to apply k-means clustering to analyze housing data, with a side-task of conducting an image compression. The process involves utilizing k-means clustering, an unsupervised machine learning algorithm, to group similar instances based on certain features.
Framework
- Problem definition
- Data
- Evaluation
- Features
- Modelling
- Experimentation
Project Hypothesis / Research Question
What are the geographical patterns and locations where individuals with high and low incomes predominantly reside, and how can an analysis of housing data contribute to the identification and understanding of these distinctive income-based clusters?
Data Sources
Features
- longitude
- latitude
- housing_median_age
- total_rooms
- total_bedrooms
- population
- households
- median_income
- median_house_value
- ocean_proximity
Progress & Findings
After inspecting the data, there seems to be missing values in the
total_bedrooms
feature. The missing values are not handled since total_bedrooms
feature is not used in this project. The first figure below displays the histogram of all the numeric features of the data which shows the distribution of each of the features. The figure to the right shows a plot of the population (denoted by the size of a point) and median income (denoted by the color of a point) by the location of the housing. These figures give a general sense of the data in hand. The figures below visualize the clusters of data points based on locations to reveal spatial clusters using K-means clustering.
Again, using K-means clustering, the data points were clustered and visualized, revealing six different clusters based on where people live based on their income level.
Side-task: Image Compression
Steps
Step-0: read an image from a bmp file
Step-1: prepare the data matrix X
Step-2: perform k-means on data matrix X
Step-3: compress the image using the cluster centers
- Make a copy of the the data matrix X, call it Y
- Then, modify the data matrix Y, such that every data point is replaced by the corresponding cluster center
- Convert the data matrix Y back to an image Ic
Step-4: visualize the compressed the image
Ic
Step-5: save the compressed image to a bmp file
Conclusion
The analysis of housing data using K-means clustering has shown interesting connections between people's incomes and where they live. One surprising finding is that some people with low incomes live in expensive houses, pointing to possible economic inequalities in certain neighborhoods. On the other hand, the identification of specific groups of well-off individuals helps us understand where the higher-income individuals tend to live. This project not only reveals these living patterns but also highlights how looking at housing data can teach us more about the complex relationships between income and living arrangements. It provides a deeper insight into the dynamics of money and living situations in different areas of California.