It’s wonderful when someone has gone to the effort to expose their site’s data through an API.
But not everyone can (or wants to) do that.
No problem. Let’s code up a web scraper that can automate the process of manually using the site.
Scrape a Wikipedia page of your choosing and record which passages need citations.
scraper.pyget_citations_needed_count
Create function named get_citations_needed_report
For example:
The first people to settle in Mexico encountered a climate far milder than the current one. In particular, the Valley of Mexico contained several large paleo-lakes (known collectively as Lake Texcoco) surrounded by dense forest. Deer were found in this area, but most fauna were small land animals and fish and other lacustrine animals were found in the lake region.[citation needed][6] Such conditions encouraged the initial pursuit of a hunter-gatherer existence.
The Mexica people arrived in the Valley of Mexico in 1248 AD. They had migrated from the deserts north of the Rio Grande[citation needed] over a period traditionally said to have been 100 years. They may have thought of themselves as the heirs to the prestigious civilizations that had preceded them.[citation needed] What the Aztec initially lacked in political power, they made up for with ambition and military skill. In 1325, they established the biggest city in the world at that time, Tenochtitlan.
get_citations_needed_by_sectionweb-scraper.Implement a Queue using two Stacks.
CHOOSE ONE VIDEO
Next Two videos use requests_html library instead of playwright but are too cool to leave out.
Statement on why this topic matter as it relates to what I’m studying in this module:
Web scraping allows you to extract data from websites, turning unstructured web data into a structured format that can be analyzed and used for various purposes. Python Playwright and XPath help automate the interaction with web pages, making the data extraction process efficient and accurate.
What are the key differences between scraping static and dynamic websites?
“It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.” - Wiki
HTML Source Code:
Javascript Execution:
Parsing Complexity:
Data Accessibility:
Page Load Speed:
Caching:
User Experience:
Explain at least three techniques or best practices that can be employed to avoid getting blocked while scraping websites.
Respect Robots.txt:
Throttle Crawling Speed:
Rotate IPs and User Agents:
What is Playwright, and how does it assist in web scraping tasks? Provide an example of a use case where Playwright would be particularly beneficial.
Playwright is a browser automation tool that can be used for web scraping, testing, and automating browser tasks. It provides a Python API (as well as APIs for other programming languages) to interact with web pages, simulate user actions, and extract information. It assists in web scraping tasks by providing a high-level, user-friendly API to control browsers without the need for external drivers. It handles browser automation more efficiently and allows for easy switching between different browser engines.
A use case is when scraping dynamic websites that heavily rely on JavaScript for rendering content. Playwright can interact with pages in a way that simulates user behavior, enabling it to handle dynamic content, makes it suitable for scraping modern web applications where traditional scraping methods may fall short.
Describe the purpose of using Xpath in web scraping, and provide an example of an Xpath expression to select a specific HTML element from a webpage.
XPath (XML Path) is a language used to navigate and query XML documents.
Example of an XPath expression to select a specific HTML element:
select the first <a> element with the class attribute set to “navigation-link” from the provided HTML:
<div class="main-navigation">
<a href="/home" class="navigation-link">Home</a>
<a href="/tutorials" class="navigation-link">Tutorials</a>
<a href="/search" class="navigation-link">Search</a>
</div>
XPath Expression:
//div[@class='main-navigation']/a[@class='navigation-link'][1]
Explanation:
//div[@class='main-navigation']: Locate the <div> element with the class attribute set to “main-navigation.”/a[@class='navigation-link']: Traverse to the <a> elements with the class attribute set to “navigation-link” within the selected <div>.[1]: Select the first matching <a> element.Write a brief reflection on your learning today, or use the prompt below to get started.
For this one, let’s go meta: What have you learned from the process of keeping a learning journal? Consider what you have learned about learning and what you have learned about yourself.