Sometimes you want to access data on the web that isn't easily available through an API. For that, web scraping is a viable alternative. Web scraping is in essence a way to programmatically visit a web site as if you were a browser and fetch the data that way. This tutorial post will guide you through the process of web scraping using Python3, and two libraries BeautifulSoup and Requests. As an example, we will build a command line application to search and print ratings for movies and TV shows on IMDB.
Project setup
To follow along with the tutorial you will need to have the following:
- Python installed
- Some basic knowledge of Python and some basic HTML/CSS knowledge
Alternatively you can check out my blog post for using dev containers and follow that guide but select the Python 3
container spec when creating the dev container. If you can't get the dev container way to work, you can refer to the repository for this blog post.
Let's begin by creating a folder for our project and installing the required dependencies:
Copymkdir web-scrape
cd web-scrape
pip3 install beautifulsoup4 requests
beautifulsoup4
lets us easily access DOM elements programmaticallyrequests
gives us a nice and easy to use interface to make HTTP requests
Researching the application's lifecycle
As stated in the introduction, we're going to focus on searching on IMDB showing the rating for some selected title. Our first step should be to do these steps in a browser and see what it looks like there.
Visit www.imdb.com in your browser and search for a movie of your choice. I searched for Jurassic Park, you should be greeted with something like the following image:
Once you're on this page, inspect one of the rows in the "Titles" table by right clicking and pressing "Inspect element", this varies from browser to browser but in Firefox it looks like this:
You might also have noticed that something opened up at the bottom of your browser, that section is commonly referred to as the dev tools. We'll be using that now to figure out the structure of the data that we're after. At the time of writing this, it looks like this:
What's displayed here is the DOM structure. The highlighted row here is what I selected when taking the "Inspect element" action earlier. This shows us how all the HTML nodes relate to each other and where in the DOM the data we're after is located. Since we want to be able to list these results in our application, we need to be able to present these search results.
We can see that the data we're interested in is located inside the <td>
element with the class result_text
, the name of the title is wrapped in an <a>
element with the href
holding the relative link to the title. This is all the data we need from this page.
Let's follow the link to this title to see what the next page we need to tackle looks like, click the link to one of the search results and you should end up on the page for the movie/show you picked:
When on this page, let's repeat the step and "Inspect element" on the rating display:
Here we can see that for this page the data we're interested in is located inside a span
element wrapped by a strong
element that is in turn wrapped by a div
element with the class ratingValue
.
Now that we have gathered all the information we need for our app, let's proceed to the coding part.
Searching on IMDB and listing the titles
Create a file scrape.py
, this will be our only file for this project. In it we will begin by importing the required dependencies and adding search functionality:
Copy# scrape.py
import requests
from bs4 import BeautifulSoup
def search(search_term):
# Make the search request to IMDB
response = requests.get(f"https://www.imdb.com/find?q={search_term}")
html = response.text
soup = BeautifulSoup(html, "html.parser")
# Find the table with the class findList
table = soup.find("table", {"class": "findList"})
# Use CSS selector syntax to get all td elements from the table with the class result_text
rows = table.select("tr td.result_text")
# Construct a list with the search results, store the title and the href in dicts
return [{"title": row.get_text().strip(), "href": row.a['href']} for row in rows]
Let's add another function to this file where we'll place our user interaction code:
Copydef run():
search_term = input("Search IMDB: ")
results = search(search_term)
num_results = len(results)
print(f"Found {num_results} results:")
# Use built-in function enumerate to access the index variable i
for i, result in enumerate(results):
print(f"({i+1}) {result['title']}")
# Don't forget this line! We have to call the run function or nothing will happen when we run our program.
run()
Save the file and we're ready to try our app to see what it looks like right now, run it by going to the terminal and running:
Copypython3 scrape.py
Enter a search term and something similar should show up:
You might notice that selecting a result does nothing at the moment, and that's cause we only implemented half the logic. It's always good to test that what you've got so far is working at least, and if you're seeing search results in your terminal then you're good to continue.
Printing the rating for the selected title
We need some more logic to request the next page and print the value from the span
element we identified earlier. You may have noticed that in the search
function we defined, we're returning the href
s but we're yet to use them. Let's incorporate them now.
In scrape.py
, add another function:
Copydef get_rating(href):
# href passed in here should be from what we found earlier, the href from the <a> tag that the title was wrapped in
response = requests.get(f"https://www.imdb.com{href}")
html = response.text
soup = BeautifulSoup(html, "html.parser")
# Select by CSS selector for .ratingValue class and get the first result (index 0), we only expect there to be one
rating = soup.select(".ratingValue")[0].span.get_text()
return rating
And let's use it in our run
function:
Copydef run():
# Main interaction
search_term = input("Search IMDB: ")
results = search(search_term)
num_results = len(results)
print(f"Found {num_results} results:")
for i, result in enumerate(results):
print(f"({i+1}) {result['title']}")
# Convert to int and subtract one to undo the addition to the index in the above loop
selection = int(input(f"Select by entering a number (1-{num_results}): ")) - 1
selected_result = results[selection]
# Pass in the href to the title we want to get the rating for
rating = get_rating(selected_result["href"])
print(f"{selected_result['title']} has a rating of {rating}!")
# Again, make sure you call this in scrape.py
run()
Running our app now results in the following behavior:
Summary
If you can use an API to get ahold of the data that you're after, you should always do so, it's faster and less prone to errors. Scraping the web can get messy sometimes depending on the structure of the DOM. As you may have already figured out, if something was to change on the layout of IMDB, it would potentially break our app.
I hope you learned something new and that you find the information provided here useful, maybe you were looking to use data from some site in your own project? Go ahead and try it out 😊.
Feel free to ask any questions.
Enjoy! Here is the associated repository for this post.