Description
During my Complete Python Developer course, I had the opportunity to dive into web scraping using Python where I scraped the “Hacker News” website. It allowed me to retrieve and display information about news articles.
Process
First, I imported the necessary Python libraries to get started. These libraries were essential for making HTTP requests, parsing HTML, and formatting the output:
requests
: This library allowed me to send HTTP requests to websites and retrieve their HTML content.
BeautifulSoup
: I used BeautifulSoup for parsing the HTML content and navigating through the HTML document to extract specific elements from the web page.
pprint
: I used pprint to format and print the final result for better readability.
Next, I sent HTTP GET requests to two pages on the Hacker News website: the main page and the second page. These requests retrieved the HTML content of these pages.
With the HTML content in hand, I created two
BeautifulSoup
objects, named soup
and soup2
, to parse the HTML content obtained from the two web pages.I then used the
select()
method on these soup
and soup2
objects to select specific HTML elements from the pages. This allowed me to extract article titles and their links, as well as information about the articles' points (votes).To combine the data from the two pages, I created two lists:
mega_links
(containing article titles and links) and mega_subtext
(containing vote information).To make the final output more meaningful, I defined a function called
sort_stories_by_vote
. This function takes a list of stories (hnlist
) and sorts them in descending order of votes.The core of the scraping process was the
create_custom_hn
function, which took in the lists of article titles and vote information, processed the data, and appended it to a list called hn
. In this function, I filtered articles with more than 99 points, collected their titles, links, and vote counts. The function then returned a list of Hacker News articles, sorted by the number of votes in descending order.Finally, to present the result in an organized and readable format, I used
pprint.pprint()
to pretty print the result of calling the create_custom_hn
function on the combined data from mega_links
and mega_subtext
. This resulted in a list of Hacker News articles sorted by the number of votes, making it easy to identify the most popular articles on the site.Conclusion
This Python script allowed me to perform web scraping to extract and organize data from web pages. It's a valuable skill for tasks like staying updated on popular news articles or for various applications involving web data extraction and analysis.