Web Scraping


Python.

Web Scraping with Python

What is web scraping? It's exactly what it sounds like! You will be scraping webpages on the internet for data. This is very vague because Web Scrapers have many different uses. There is a lot of different forms of information on the internet that we are able to get. Maybe you want to collect news articles about a specific event across multiple news sites into one place. Perhaps you want to scrape and download free pdf's from a website rather than having to go and click each link to download them.

You can have a lot of fun with web scraping, and Python makes it very easy to do!

This module assumes a basic understanding of programming and Python

Further information on the requests library. Further information on the beautiful soup library.

Disclaimer:

Web scraping can involve legal and ethical considerations. Many websites have terms of service (ToS) that explicitly prohibit scraping. Violating these terms could lead to legal consequences under laws like the Computer Fraud and Abuse Act (CFAA) in the U.S. or equivalent laws in other countries. Additionally, scraping personal or sensitive data may violate privacy laws like GDPR (Europe) or CCPA (California). Always check a website's ToS and ensure compliance with local laws before scraping.


Challenges

GET requests

A GET request is the most common internet request ever sent. Anytime you browse to a webpage on the internet, say 'www.google.com', your browser will send a GET request to the google servers. Your browser is basically asking the google servers "Hey, can I GET the data at www.google.com?", the server then processes that GET request and sends you over the data you ask for. This data comes to your browser as HTML/CSS markup that your browser then reads and renders into the famous search page we all know and love.

This all happens when we type a website in our url and hit enter, that's easy enough, but how do we send a GET request with python?

Python Requests Library

If you're familiar with Python you may know that it has access to a large quantity of 'libraries' that users can integrate into their own code. These libraries provide functionality that extends Python in useful ways. For instance, the 'math' library gives the user access to more advanced mathematical operations. Here we will be utilizing a library called 'requests'. This extends Python's capabilities to be able to perform web requests!

Now, the 'math' library comes included with Python when you first install it, 'requests' does not. In a normal environment you would have to install 'requests' yourself by running the command pip install requests. However, in this environment 'requests' is already installed for you!

In this challenge you will have to use python to send a GET request to http://localhost

Use this example to perform your GET request:

# Import the requests library so we can use it
import requests

# Create a variable named response to store the data we
# receive from calling requests.get
response = requests.get(ENTER YOUR URL HERE)

# Print the website data we were sent back
print(response.text)

Real HTML Output

In this challenge you will familiarize yourself with what actual website data looks like. Last challenge you just received the flag as text. Now you will receive actual website data and have to look through it to find your flag.

In this challenge you will have to use python to send a GET request to http://localhost. After that you should print the website data and look through the data to find the flag.

POST requests

A POST request is one of the most commonly used HTTP methods for sending data to a server. Instead of asking for information (as in a GET request), a POST request sends data to the server to be processed. This is typically used when you submit a form on a website, such as logging in, creating a new account, or posting a comment. For instance, when you fill in your details and hit "Sign Up" on a website, your browser sends a POST request to the server with all your information.

Think of a POST request as saying, "Here’s some data; please handle it." The server receives the data, processes it (maybe saves it in a database), and often sends back a response to confirm that the data was received successfully or to give you more information.

In python, you will need to specify your POST data using a dictionary. You will also need to utilize the post method in the requests library as follows:

# Import the requests library so we can use it
import requests

# Create POST data dictionary
data = {'username': 'wkiffin', 'password': 'Password1!'}

# Create a variable named response to store the data we
# receive from calling requests.post. Pass our data through
# the 'json' argument
response = requests.post(ENTER YOUR URL HERE, json=data)

# Print the website data we were sent back
print(response.text)

In this challenge, you have to send a POST request to http://localhost with any data you want.

Beautifulsoup4

Using the requests library we are capable of getting website information that we can then manipulate in our programs. However, as you witnessed in the third requests challenge, the data we get from websites is filled with HTML tags that taints the data we really want to look at. Thankfully someone created a library called beautifulsoup4 that allows us to scan through this data, find particular tags, and grab the data we really want.

We are only going to touch the bare surface of beautifulsoup in these beginner challenges. To dive deeper visit the Beautiful Soup Documentation page.

The Basics

from bs4 import BeautifulSoup

# Create a 'Soup' object that we use to parse HTML
soup = BeautifulSoup(WEBSITE_DATA, 'html.parser')

# We can use this soup object to identify all
# occurances of a particular tag.

links = soup.find_all('a') # This returns a list that contains ALL anchor tags.

# Here is how you could go about scraping all
# of the actual links from those anchor tags.
for link in links:
	print(link.get('href'))

NOTE: If Beautiful Soup is not able to find any tags that you specify, it will return None. This allows you to test if you found any tags with the following if statement:

if links == None:
	# Code that runs if we DON'T find links
else:
	# Code that runs if we DO find links

In your first Beautiful Soup challenge you will need to tell me exactly how many links are in the HTML code that you receive through a GET request. Then through a POST request you will send back the answer as follows {'answer': <integer>}, where integer is how many 'a' tags there are.

In the last challenge you practiced finding generic HTML tags. In this challenge we will practice getting tags that have a specific "class" and "id" identifier attached to them. This looks very similar to how we got our tags. Now, instead of supplying the particular tag we want to find ('p', 'a', 'span', etc.) we will supply the name of the class and id we want to find. However, we need to let Python know that we're looking for a class or id, and we do this by using the CSS syntax used to identify them. This means when we are searching for a class name of "button" we will pass the string ".button", where the '.' denotes a class. If we want to identify a tag with the ID "submit" we will pass the string "#submit", where the '#' denotes we are looking for an ID.

In this challenge you will need to find out how many tags belong to the "secret" class. You will also need to get the content of the tag that has an id of "findme". You will then send your answer as a POST request with the following data:

{'answer1': NUMBER OF TAGS WITH SPECIFIC CLASS, 'answer2': STRING FROM TAG WITH SPECIFIED ID}

Get to it!

Now that you know how to use Beautiful Soup to grab desired tags and information from a page we will test your skills. In this challenge you will be presented with HTML code that represents a table. This table contains a list of many items. Their name, product category, price, and product number. Using requests to get the HTML code from the webpage, and Beautiful Soup to parse that data, you must:

  1. Tell me how many products are within the price range of $20 - $50.
  2. Get the Product Number for the "Kitchen Knife Set".
  3. Tell me what the cheapest product in the table is.

You can format your data as follows:

data = {'answer1': INTEGER, 'answer2': STRING OR INTEGER, 'answer3': STRING}

This challenge is relatively difficult and will test what you've learned with requests, beautifulsoup, and python in general. It may take some time to complete fully and you may need to use outside resources.

HINT: This Stack Overflow article gives a great example of how to parse out data from a table.


30-Day Scoreboard:

This scoreboard reflects solves for challenges in this module after the module launched in this dojo.

Rank Hacker Badges Score
Loading.