Python 101 - How to Scrape a Website
The Internet is the host of much of the world’s information, both past and present. You can find history, news, comics, and much more on the Internet. As a software developer, you might want to gain access to the troves of data that exist on the Internet. Some web pages provide a free or paid Application Programming Interface (API) that you can use to programmatically access their data. However, most websites do not offer an API, so you must resort to scraping them to gain programmatic access to the information they provide.
Scraping a website means fetching a page's HTML content from the Internet, parsing it, and extracting the bits and pieces that interest you. In this tutorial you will learn about the following:
Rules for Web Scraping
Preparing to Scrape a Website
Scraping a Website
Downloading a File
There are several web scraping packages for Python. The most popular are Beautiful Soup and Scrapy. This tutorial will focus on Beautiful Soup.
Let’s get started!
Rules for Web Scraping
Most websites have rules regarding their content. Sometimes it’s just copyright information that you will need to abide by, but here are some tips to keep in mind:
Always check the terms and conditions on a website before you scrape from them. Violating the terms can land you in legal trouble!
Commercial websites usually have limits on how often you can scrape and what you can scrape
Your application can access a site much faster than a human can, so don’t access a site too often in a short amount of time. This can cause a website to slow down or fail and may be illegal
Websites change constantly, so you can expect your scraper to fail some day too
When scraping data, you need to realize that you will get a lot of data you don’t care about. Be prepared to do a lot of data cleaning to extract the information that is relevant for you.
Now let’s get set up so you can start scraping!
Preparing to Scrape a Website
Beautiful Soup is the most popular web scraping package. It is not included with Python, so you will need to install it with pip:
pip install beautifulsoup4
Beautiful Soup needs something to parse, which means you need to have a way to download a web page. You can do that using any of the following:
urllib- comes with Pythonrequests- a popular 3rd party Python packagehttpx- another popular 3rd party Python package
The latter two are easier to use than Python’s own library, but that also means that you have to go through an extra step of installing one or more new packages. For the purposes of this tutorial, you will use urllib. However, if you need to use authentication with a web page before you can download, then you should look at one of those other packages as they will make that much easier.
There is one crucial tip to keep in mind when it comes to scraping a web page: Your web browser can help you. Most web browsers come with developer tools built-in that you can use to inspect websites. The path to open up those tools is slightly different across browsers, though. Let’s look at how that works by opening up my blog in Google Chrome:
https://www.blog.pythonlibrary.org
Then right-click anywhere on the web page and choose the Inspect option.
Alternatively, you can also open up the developer tools through the menu View → Developer → Developer Tools. After choosing to inspect an item on the web page, Google Chrome will open up a sidebar on the right of your browser that will look something like this:
You can now select elements within the sidebar and your browser will highlight the relevant parts of your site on the left.
Mozilla Firefox has a very similar interface except that instead of a sidebar, it appears along the bottom of your browser. Try both of these tools and familiarize yourself with their functions. You will see that developer tools are very similar to each other, no matter which modern browser you are working with. Once you’ve gained some understanding of how your browser’s developer tools work and what they offer, you’ll be able to use them to scrape a website much more effectively.
After understanding what tools you can use to inspect and learn about your website’s structure, you are ready to start scraping it.
Scraping a Website
Let’s pretend that you have been tasked with getting all the current titles and links to my blog. This is a common task when you are building a website that aggregates data from other sites.
The first step is to determine how to download the main page’s HTML. Here is some example code:
This short code snippet will make a request to my server, fetch the website's HTML, and store it in the variable.html. This is a nice little script, but not very reusable. You can turn this code snippet into a function. Open up a new file named scraper.py and add the following:
When you run this code, it should return the HTML for my blog's main page. If you get an SSL: CERTIFICATE_VERIFY_FAILED on macOS, then you will need to go to where you installed Python and run the Install Certificates.command to fix that issue. You can read more about resolving this issue at the following link:
If you use the techniques from the previous section to inspect the title of an article from my blog, you will see that they are contained in “h1” tags. With that in mind, you can update the scraper.py program to import BeautifulSoup and search for all the “h1” tags:
In this code, you add a new import for BeautifulSoup and a new function: scraper(). You use scraper() to call download_html() and then parse the HTML with BeautifulSoup(). Next, you use findAll() to search for all the “h1” tags. This returns a ResultSet object, which you can iterate over.
If an item in the result set has the attribute “a”, this means that the “h1” title element contains a HTML link element, which looks similar to this: <a href="url">Link Name</a>. Open up your browser’s developer tools and verify that you can see these link elements nested in some of the titles. If your code discovers an “a” element in your title, this also means that this HTML element has an HTML attribute called href. This is the HTML attribute on a link element that contains the URL value that you are interested in.
You can use that information to grab the hyperlink itself and make it into a key for your articles dictionary. Then set the value to the title of the article. Finally, you loop over the articles and print out the title and corresponding link.
Give it a try and see what the output is. If you would like a challenge, try to figure out how to scrape all the links on the page instead of only the hyperlinks that are nested in “h1” headings.
Now let’s move on and learn how to download a file from the Internet!
Downloading a File
In the previous section, you learned how to download the HTML of a web page. However, web pages host much more than HTML. They can also contain other types of content. For example, they can contain images, PDFs, Excel documents, and much more. If your browser can download a file, then there is some way for Python to do so too!
As you know, the Python programming language comes with a module named urllib. You can use urllib for downloading files. If you need to log in to a website before downloading a file, it may be worth looking at a 3rd party module, such as requests or httpx, because they make working with credentials much easier. The urllib library works for these, too, but it takes significantly more code.
Let’s find out how you can use the urllib module to download a binary file:
In this example, you use urllib.request.urlretrieve() to download the specified URL. This function takes as input the URL to download, as well as the path to save the content. You use code.zip as the name of the output file here.
There is yet another way to download a file using urllib.request.urlopen(). Here’s an example:
When you open a URL using urllib.request.urlopen(), it returns a file-like object that you can use to read() the file. Then you can create a file on your computer using Python’s open() function and write that file out. Since this is a binary file, you will need to open the file for writing in binary mode (i.e. wb). You can use this method to download a file when you want to let the user know the download's progress.
If you are feeling adventurous, you should try using Beautiful Soup to parse a web page for images or other binary file types and download one or more of them. Just be careful not to overwhelm the target website.
Wrapping Up
Web scraping is a bit of an art. You will need to learn how to navigate a website programmatically to succeed. You will also need to realize up-front that your code will not work forever, as websites change often. As with any skill, it comes down to putting in the time and training to get good at web scraping. In this tutorial, you learned about the following:
Rules for Web Scraping
Preparing to Scrape a Website
Scraping a Website
Downloading a File
When you venture deeper into web scraping, you will encounter websites that make scraping more difficult than others. Most websites today contain JavaScript that dynamically generates their content by executing in your browser. This means that straight-up HTTP requests with Python, as you were using above, won’t be enough to get the content that you are interested in.
In these scenarios, you will need to use tools other than those mentioned here. For example, you might find Selenium useful for automating interactions on a website, and phantomJS for scraping a site that gets dynamically generated with JavaScript. Selenium itself can be used for web scraping too.
This tutorial only scratches the surface of what you can do with web scraping and Python. Go out and start practicing some scraping on your own!







