Web Scraping Using Python : A Beginner Guide With Steps

Of Course, Data is the new oil. You can ask any Data Scientist you will know how data is meaningful to them. Top USA companies like Amazon, Netflix, Google are using the users’ profile data to improve their user experience. But when you want to get access to their data they will not provide you. The sites like Facebook. Twitter e.t.c have APIs for developers. You can use them for building your apps. But what about to sites that don’t have APIs. In this tutorial, You will know how to do web scraping using Python with BeautifulSoup and Python.

What is BeautfulSoup?

It is a Python Library for parsing HTML or XML file and extracting content from HTML file. You will generally use to extract data or Html attributes like links, title, the content of the post, heading e.t.c. You will find two versions BeautifulSoup 3 and BeautifulSoup 4. The previous version is for Python 2.xx and the fourth version is for Python 3.xxx. I recommend you to use BeautifulSoup 4 for your projects.

How to Install BeautifulSoup?

BeautfulSopu 4 can easily be installed if you are using Unix based operating system like Debian, Fedora or Ubuntu. You have to type the following command on terminal.

apt-get install python-bs4

You can also install it using pip command in Python.

pip install beautifulsoup4

Step by Step Web Scraping using Python

Step 1: Importing of necessary python packages

You will import beautifulSoup and requests after its installation using the import library statement and pass the document to extract the information.

The following python library will be imported.

from bs4 import BeautifulSoup 
import requests

Step 2: Create a response for passing the URL field.

Keep in mind that beautifulSoup does not download the webpage, therefore you will use the requests python library for downloading the page. A response variable will be used for passing the URL field.

response = requests.get(url)

Inside the requests.get() function, you will put the URL address to download the webpage and it be will be stored inside the variable response.

Step 3: Pass the response with filters in the beautifulSoup Method.

After getting the response from the URL, You will pass the downloaded content to the BeautifulSoup object with features. Here we are passing parser for parsing the HTML that is “html.parser“.

soup = BeautifulSoup(response.content ,features="html.parser")

Step 4: Use the filters for extracting a specific content

In the above step when soup variable will store the HTML code of the URL you provided. But I do not want to print the HTML code. I want to find the number of images and videos in that url. For this task to happen you will use filters. BeautifulSoup has built-in method findall() .It is generally useful for finding the specific tag or attributes of the HTML code and return lists as a result.

Suppose you want to find the number of images in our blog content Top 5 Shows for Data Scientist. Then you will use the following syntax.

images_list = soup.findall('img')
print(len(images_list)) #print the number of images

You can also use the class of the HTML for finding the information in that specific class. Let’s take another example, you have to extract the Title of the Content. Then you will use the following filter. The title is generally h1 tag so you will use soup.h1 will select all the elements and text inside the h1 tag. To get the title use soup.h1.text.

soup = BeautifulSoup(response.content ,features="html.parser")
print(soup.h1.text)

Full Code of the above example.

Finding the Number of the Images

from bs4 import BeautifulSoup
import requests

#Functions of Finding Number of Images
def findNumberOfImage(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content ,features="html.parser")
    images_list = soup.find_all('img')
    return(len(images_list)) #print the number of images

    print(findNumberOfImage('https://www.datasciencelearner.com/top-5-shows-for-data-scientist-ai-based-entertainment/'))

Output

finding the number of image — Number of Image

Find the Title of the Post

from bs4 import BeautifulSoup
import requests

#Functions of Finding Number of Images
def findTitleOfPost(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content ,features="html.parser")
    return (soup.h1.text)

    print(findTitleOfPost('https://www.datasciencelearner.com/top-5-shows-for-data-scientist-ai-based-entertainment/'))

Output

Conclusion

Web Scraping using Python is a good way for extracting the webpage information if you don’t have API access to the page. But you will get only limited data I mean only data available in the HTML part. If the sites have API access available then I will recommend you to use them. Generally, Sites that have API access provide you data in JSON format that is far better than XML.

This basic tutorial on Web scraping using Python is only an overview of how to use beautifulSoup quickly. If you want to add any suggestions just contact us, we will be ready to support you.