Web scraping Yell with Python, Beautiful Soup, and Requests

Published March 01, 2021 by Danny Moran

Table of Contents

PAGE CONTENT

Introduction

One of my goals this year is to improve some of my programming knowledge, especially with Python, so when I was speaking with a friend and he was telling me about how he was manually gathering data from Yell, I realised I could use this situation to improve some of my skills as well as help a friend out.

The scope of the project

Create a small app that a search term and location can be entered and it would return all of the company name, phone number, address, and website, where these were available.

The tooling used

I decided that Python matched with the Beautiful Soup and Requests libraries would be able to accomplish this. I also used the json library to format the response and the Time library to slow down the web scrape as to not get blocked by Yell.

The Python script

from bs4 import BeautifulSoup
import requests
import json
import time


def extract(pages):
    url = f"https://www.yell.com/ucs/UcsSearchAction.do?keywords={searchTerm}&location={searchLocation}&pageNum={pages}"
    headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content, "html.parser")
    return soup


def transform(soup):
    article = soup.find_all("div", class_="row businessCapsule--mainRow")
    for item in article:
        try:
            companyName = item.find("h2", itemprop="name").text
            companyNumber = item.find("span", itemprop="telephone").text.strip()
            companyAddress = item.find("span", itemprop="address").text.strip().replace("\n", "").replace(",","")
            companyWebsite = item.find("a", rel="nofollow noopener", class_="btn btn-yellow businessCapsule--ctaItem")['href']
        except:
            companyName = "None"
            companyNumber = "None"
            companyAddress = "None"
            companyWebsite = "None"

        companyInfo = {
            "companyName": companyName,
            "companyNumber": companyNumber,
            "companyAddress": companyAddress,
            "companyWebsite": companyWebsite
        }
        joblist.append(companyInfo)
    return


searchTerm = "enter+search+term+here"
searchLocation = "enter+location+here"
searchPages = 2
pages = searchPages + 1

joblist = []
for i in range(1, pages, 1):
    print(f"Started page: {i}")
    e = extract(i)
    transform(e)
    print(f"Finished page: {i}")
    time.sleep(3)

print(json.dumps(joblist, indent=1))

Prerequisites for being able to run the script

You need to have the following installed in your Python 3 environment:

Beautiful Soup 4: pip install bs4
Requests: pip install requests
json: pip install json (this should already be installed but if you are using a minimal version of Python, you might have to install it manually)

Breakdown of how the script works

from bs4 import BeautifulSoup
import requests
import json
import time

This imports the required libraries that are used within the script

def extract(pages):
    url = f"https://www.yell.com/ucs/UcsSearchAction.do?keywords={searchTerm}&location={searchLocation}&pageNum={pages}"
    headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'}
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.content, "html.parser")
    return soup

This function takes the variables defined later on in the script and crates a url with the information and then goes to the link using the requests library, and then saves the response into a variable called Soup.

def transform(soup):
    article = soup.find_all("div", class_="row businessCapsule--mainRow")
    for item in article:
        try:
            companyName = item.find("h2", itemprop="name").text
            companyNumber = item.find("span", itemprop="telephone").text.strip()
            companyAddress = item.find("span", itemprop="address").text.strip().replace("\n", "").replace(",","")
            companyWebsite = item.find("a", rel="nofollow noopener", class_="btn btn-yellow businessCapsule--ctaItem")['href']
        except:
            companyName = "None"
            companyNumber = "None"
            companyAddress = "None"
            companyWebsite = "None"

        companyInfo = {
            "companyName": companyName,
            "companyNumber": companyNumber,
            "companyAddress": companyAddress,
            "companyWebsite": companyWebsite
        }
        joblist.append(companyInfo)
    return

This funcation takes the previously returned Soup variable and then parses out the companyName, companyNumber, companyAddress, and companyWebsite where available and then saves the information into a dictionary called companyInfo.

searchTerm = "enter+search+term+here"
searchLocation = "enter+location+here"
searchPages = 2
pages = searchPages + 1

These are the variables that can be modified to change the results that are returned.

searchTerm: This is the term that will be searched for. If you need to have a space in the search term, use + as a separator instead of a space.
searchLocation: This is the location the searchTerm will be applied to. If you need to have a space in the search term, use + as a separator instead of a space.
searchPages: This is the number of pages the script will run through before finishing. The maximum number you can use is 10 as this is all Yell goes up to.
pages: Don’t modify this as this variable just adds 1 to the value of searchPages due to the way the Yell search URL needs to be constructed.

joblist = []
for i in range(1, pages, 1):
    print(f"Started page: {i}")
    e = extract(i)
    transform(e)
    print(f"Finished page: {i}")
    time.sleep(3)

This for loop, loops the script so that all pages are parsed. It waits for 3 seconds after each loop as to not get blocked by Yell for too many requests.

print(json.dumps(joblist, indent=1))

This prints the results into the console in a json format.