Website Scraping
👩‍💻

Website Scraping

Date
Sep 9, 2022
Tags
Python
Data Extraction

Description

During my Complete Python Developer course, I had the opportunity to dive into web scraping using Python where I scraped the “Hacker News” website. It allowed me to retrieve and display information about news articles.

Process

First, I imported the necessary Python libraries to get started. These libraries were essential for making HTTP requests, parsing HTML, and formatting the output:
  • requests: This library allowed me to send HTTP requests to websites and retrieve their HTML content.
  • BeautifulSoup: I used BeautifulSoup for parsing the HTML content and navigating through the HTML document to extract specific elements from the web page.
  • pprint: I used pprint to format and print the final result for better readability.
Next, I sent HTTP GET requests to two pages on the Hacker News website: the main page and the second page. These requests retrieved the HTML content of these pages.
With the HTML content in hand, I created two BeautifulSoup objects, named soup and soup2, to parse the HTML content obtained from the two web pages.
I then used the select() method on these soup and soup2 objects to select specific HTML elements from the pages. This allowed me to extract article titles and their links, as well as information about the articles' points (votes).
To combine the data from the two pages, I created two lists: mega_links (containing article titles and links) and mega_subtext (containing vote information).
To make the final output more meaningful, I defined a function called sort_stories_by_vote. This function takes a list of stories (hnlist) and sorts them in descending order of votes.
The core of the scraping process was the create_custom_hn function, which took in the lists of article titles and vote information, processed the data, and appended it to a list called hn. In this function, I filtered articles with more than 99 points, collected their titles, links, and vote counts. The function then returned a list of Hacker News articles, sorted by the number of votes in descending order.
Finally, to present the result in an organized and readable format, I used pprint.pprint() to pretty print the result of calling the create_custom_hn function on the combined data from mega_links and mega_subtext. This resulted in a list of Hacker News articles sorted by the number of votes, making it easy to identify the most popular articles on the site.

Conclusion

This Python script allowed me to perform web scraping to extract and organize data from web pages. It's a valuable skill for tasks like staying updated on popular news articles or for various applications involving web data extraction and analysis.
 

👋🏻 Let’s chat!

If you are interested in working with me, have a project in mind, or just want to say hello, please don't hesitate to contact me.

Find me here 👇🏻

notion image
Please do not steal my work. It took uncountable cups of coffee and sleepless nights. Thank you.