Web scraping Yell with Python, Beautiful Soup, and Requests
Published March 01, 2021 by Danny Moran
Table of Contents
Introduction
One of my goals this year is to improve some of my programming knowledge, especially with Python, so when I was speaking with a friend and he was telling me about how he was manually gathering data from Yell, I realised I could use this situation to improve some of my skills as well as help a friend out.
The scope of the project
Create a small app that a search term and location can be entered and it would return all of the company name, phone number, address, and website, where these were available.
The tooling used
I decided that Python matched with the Beautiful Soup and Requests libraries would be able to accomplish this. I also used the json library to format the response and the Time library to slow down the web scrape as to not get blocked by Yell.
The Python script
from bs4 import BeautifulSoup
import requests
import json
import time
def extract(pages):
url = f"https://www.yell.com/ucs/UcsSearchAction.do?keywords={searchTerm}&location={searchLocation}&pageNum={pages}"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
return soup
def transform(soup):
article = soup.find_all("div", class_="row businessCapsule--mainRow")
for item in article:
try:
companyName = item.find("h2", itemprop="name").text
companyNumber = item.find("span", itemprop="telephone").text.strip()
companyAddress = item.find("span", itemprop="address").text.strip().replace("\n", "").replace(",","")
companyWebsite = item.find("a", rel="nofollow noopener", class_="btn btn-yellow businessCapsule--ctaItem")['href']
except:
companyName = "None"
companyNumber = "None"
companyAddress = "None"
companyWebsite = "None"
companyInfo = {
"companyName": companyName,
"companyNumber": companyNumber,
"companyAddress": companyAddress,
"companyWebsite": companyWebsite
}
joblist.append(companyInfo)
return
searchTerm = "enter+search+term+here"
searchLocation = "enter+location+here"
searchPages = 2
pages = searchPages + 1
joblist = []
for i in range(1, pages, 1):
print(f"Started page: {i}")
e = extract(i)
transform(e)
print(f"Finished page: {i}")
time.sleep(3)
print(json.dumps(joblist, indent=1))
Prerequisites for being able to run the script
You need to have the following installed in your Python 3 environment:
- Beautiful Soup 4: pip install bs4
- Requests: pip install requests
- json: pip install json (this should already be installed but if you are using a minimal version of Python, you might have to install it manually)
Breakdown of how the script works
from bs4 import BeautifulSoup
import requests
import json
import time
This imports the required libraries that are used within the script
def extract(pages):
url = f"https://www.yell.com/ucs/UcsSearchAction.do?keywords={searchTerm}&location={searchLocation}&pageNum={pages}"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
return soup
This function takes the variables defined later on in the script and crates a url with the information and then goes to the link using the requests library, and then saves the response into a variable called Soup.
def transform(soup):
article = soup.find_all("div", class_="row businessCapsule--mainRow")
for item in article:
try:
companyName = item.find("h2", itemprop="name").text
companyNumber = item.find("span", itemprop="telephone").text.strip()
companyAddress = item.find("span", itemprop="address").text.strip().replace("\n", "").replace(",","")
companyWebsite = item.find("a", rel="nofollow noopener", class_="btn btn-yellow businessCapsule--ctaItem")['href']
except:
companyName = "None"
companyNumber = "None"
companyAddress = "None"
companyWebsite = "None"
companyInfo = {
"companyName": companyName,
"companyNumber": companyNumber,
"companyAddress": companyAddress,
"companyWebsite": companyWebsite
}
joblist.append(companyInfo)
return
This funcation takes the previously returned Soup variable and then parses out the companyName, companyNumber, companyAddress, and companyWebsite where available and then saves the information into a dictionary called companyInfo.
searchTerm = "enter+search+term+here"
searchLocation = "enter+location+here"
searchPages = 2
pages = searchPages + 1
These are the variables that can be modified to change the results that are returned.
- searchTerm: This is the term that will be searched for. If you need to have a space in the search term, use + as a separator instead of a space.
- searchLocation: This is the location the searchTerm will be applied to. If you need to have a space in the search term, use + as a separator instead of a space.
- searchPages: This is the number of pages the script will run through before finishing. The maximum number you can use is 10 as this is all Yell goes up to.
- pages: Don’t modify this as this variable just adds 1 to the value of searchPages due to the way the Yell search URL needs to be constructed.
joblist = []
for i in range(1, pages, 1):
print(f"Started page: {i}")
e = extract(i)
transform(e)
print(f"Finished page: {i}")
time.sleep(3)
This for loop, loops the script so that all pages are parsed. It waits for 3 seconds after each loop as to not get blocked by Yell for too many requests.
print(json.dumps(joblist, indent=1))
This prints the results into the console in a json format.