Web-scraping with BeautifulSoup4#

This document covers basic usage of bs4 (Beautiful Soup 4) for scraping a webpage. We will primarily discuss extracting information from one webpage, and leave web-crawling to an advanced class on web scraping.

To scrape or not to scrape#

Unlike APIs, which are designed for programs/applications to interact with the data, web-scraping is directly working with user-facing websites for humans.

Web scraping benefits:

Web scraping challenges:

Any content that can be viewed on a webpage can be scraped.

Rarely tailored for researchers.

No API needed.

Your IP can be blocked (403)

No rate-limiting or authentication (usually).

Messy, unstructured, inconsistent.

Entirely site-dependent.

Rule of thumb: Check if there is an API. If not, then consider scraping.

Ethics of web scraping#

Several considerations before scraping:

  • Read the terms and conditions of data use.

  • robots.txt

  • Self-throttle, as in API usage.

  • Web-scrapers require regular maintenance (best coupled with CI/CD).

Anatomy of a webpage#

A website is typically built up from some combination of codebase and database. The front-end product combines HTML, CSS stylesheets, and javascript.

anatomy-html

Anatomy of a website (Adobe)

anatomy-html-css

Anatomy of a website, with CSS styles (Adobe)

Parsing a website#

Retrieving the website content is not difficult - extracting the exact useful information is.

HTML, briefly#

html-doc

HTML structure of this Jupyter notebook.

HTML as a tree#

html-tree

HTML as a tree. Each branch is an element.

Three components of HTML (Tags, Attributes, and Content)#

html-element

An example of an HTML element.

Example tags#

Tag

Meaning

<head>

page header (metadata, etc

<body>

holds all of the content

<p>

regular text (paragraph)

<h1>,<h2>,<h3>

header text, levels 1, 2, 3

ol,<ul>,<li>

ordered list, unordered list, list item

<a href="page.html">

link to “page.html”

<table>,<tr>,<td>

table, table row, table item

<div>,<span>

general containers (can contain CSS, JavaScript, etc.)

Example with scrapethissite.com#

# beautifulsoup4 package and lxml parser
!pip install bs4
!pip install lxml
from bs4 import BeautifulSoup
import requests

url = 'https://www.scrapethissite.com/pages'

r = requests.get(url)
# accessing content
r.content[:100]
# using bs4
soup = BeautifulSoup(r.content)
# selecting by tags
h3_list = soup.find_all('h3')
# locate by tags ('a' contains links)
h3_list[0].find_all('a')[0]
# locate neighboring content
h3_list[0].find_all('a')[0].find_next('a').find_next('section')

Practice - BeautifulSoup (Item 4 is left as a challenge for you)#

  1. Locate the tags and attributes for the following items:

weather-element

Fact card from weather.com.

  1. Create a dataframe with columns as the items:

    • DateDay

    • Temperature

    • Rain

    • UV

    • Description

  1. Using BeautifulSoup, populate the table for the first day.

  2. Repeat for the next nine days.

Further reference#

Read The Legalities and Ethics of Web Scraping cite:p{mitchell2018web} for a brief discussion on web-scraping ethics.