How to Use Beautifulsoup to parse html (html.parser)

Beautifulsoup is a web scraping python package. It allows you to parse HTML as well as XML documents. It creates a parse tree that allows scrapping specific documents from the HTML or XML file. In this entire tutorial, you will know how to implement beautifulsoup HTML parser with steps.

Steps to implement Beautifulsoup HTML parser

In this section, you will know all the basics steps to use beautifulsoup HTML parser. If you implement all the given steps here, you will definitely clear any queries on HTML parser.

Step 1: Import the necessary libraries

The first step is to import all the necessary libraries. I am using the basic example for the HTML parse that uses the only beautifulsoup. Make sure you have already installed beautifulsoup in your system. Let’s import it using the import statement.

from bs4 import BeautifulSoup

Step 2: Create a Sample Data

In this step, I am creating an HTML document that will be used for implementing beautifulsoup HTML parser. You can also directly crawl any URL using the requests python package. But for simplicity, I am using the simple HTML doc only.

data = """
<html>
<head>
<title>Data Science Learner</title>
</head>

<body>
<p class="title"> id="title" <b>Data Science Learner Links</b></p>
<p class="links">Links
<a href="http://example.com/dsl1" class="element" id="link1">1</a>
<a href="http://example.com/dsl2" class="element" id="link2">2</a>
<a href="http://example.com/dsl3" class="avatar" id="link3">3</a>
<p> line ends</p>
</body>
</htm>

"""

Step 3: Parse the HTML Document

Now the next step is to parse the document. In my example, I have to parse HTML docs that why I will pass the html.parser as an argument to the BeautifulSoup() function. If you want to parse XML document then use xml.parser. Use the below line of code to create a parse tree for your HTML document.

soup = BeautifulSoup(data, "html.parser")

Step 4: Get any text

Now the last step is to extract information from the parsed HTML docs. You can extract using the dot (.) operator. There are also some specific functions like find_all(), find() e.t.c that allows you to extract a group of information from the class or tags.

But for the sake of simplicity, I am extracting some basic information.

Example 1: Extracting Head title

soup.head.title

It will output the title with tag information. But if you want the title text only then you have to .text as a suffix to the above code.

soup.head.title.text

Output

Get heading title text from HTML docs

Example 2: Extracting All the links

You can also find all the links present in the HTML document. To do so you have to use the find_all() method. It will output all the links as a list.

Execute the below line of code.

soup.find_all("a")

Output

Find all links in the document

Conclusion

These are the steps for implementing beautifulsoup HTML parser. This tutorial is just the basic implementation. There are also other things you can extract using the beautifulsoup. I will keep updating this content in the near future. You can also contact us if you want more information on it.

Source:

Beautifulsoup Documentation