Scraping HTML Text using BeautifulSoup

We’ll show you how to pull out various pieces of text from web pages. We’ll go through the process using the BeautifulSoup from our sample HTML page example.

Sample HTML File

Below is the HTML file that we have used to find an HTML tag that contains certain text using BeautifulSoup.

HTML

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>GFG </title>
</head>
<body>
    <a href="https://www.w3wiki.org/">Geeks For Geeks</a>
    <a href="Dummy Check Text">Geeks For Geeks</a>
    <a href="Dummywebsite.com">Dummy Text</a>
 
    <h1>Hello</h1>
    <h1>Python Program</h1>
 
   <span class = true>Geeks For Geeks</span>
   <span class = false>Geeks For Geeks</span>
 
   <li class = 1 >Python Program</li>
   <li class = 2 >Python Code</li>
 
   <table>
       <tr>GFG Website</tr>
   </table>
 
</body>
</html>

Output:

Finding Anchor Tag Containing Particular Text

In this example, we are using BeautifulSoup to parse the content of an HTML file named gfg.html. By this we can find how to get meta by name beautiful soup. Specifically, we are searching for an anchor tag (<a>) within this HTML file that contains the text “Geeks For Geeks”. Once the tag is found, it is printed to the console.

Methods Used

Open( filename, mode ): It opens the given filename in that mode which we have passed.
find_all ( ): It finds all the patterns in the file that will match with the passed expression.

Python3

from bs4 import BeautifulSoup
 
# Reading the content of gfg.html
with open("gfg.html", "r") as file:
    content = file.read()
 
soup = BeautifulSoup(content, 'html.parser')
 
# Finding an anchor tag containing the text "Geeks For Geeks"
anchor_tag = soup.find('a', text='Geeks For Geeks')
 
print(anchor_tag)

Output:

<a href="https://www.w3wiki.nethttps://www.w3wiki.org/">Geeks For Geeks</a>

Finding All Tag Containing the Text

In this example, we are utilizing BeautifulSoup’s find method to search for any HTML tag within the gfg.html content that contains the text “Geeks For Geeks”. Once the tag is located, it is printed to the console.

Python3

# Finding a tag containing the text "Geeks For Geeks"
text_tag = soup.find(lambda tag: tag.name and "Geeks For Geeks" in tag.text)
 
print(text_tag)

Output:

<a href="https://www.w3wiki.nethttps://www.w3wiki.org/">Geeks For Geeks</a>

Finding the First 3 Anchor Tags

In this example, we are using BeautifulSoup’s find_all method to locate the first three anchor tags (<a>) within the gfg.html content. The limit=3 parameter ensures that only the first three tags are retrieved. Subsequently, each of these tags is printed to the console.

Python3

# Finding first 3 anchor tags
limited_tags = soup.find_all('a', limit=3)
 
for tag in limited_tags:
    print(tag)

Output:

<a href="https://www.w3wiki.nethttps://www.w3wiki.org/">Geeks For Geeks</a>
<a href="https://www.w3wiki.netDummy Check Text">Geeks For Geeks</a>
<a href="https://www.w3wiki.netDummywebsite.com">Dummy Text</a>

Find a HTML Tag that Contains Certain Text

In this example, BeautifulSoup is used to search gfg.html for specific text patterns in different HTML tags, and the found tags are printed to the console.

Python3

# Importing library
from bs4 import BeautifulSoup
import re
 
# Opening and reading the html file
file = open("gfg.html", "r")
contents = file.read()
 
soup = BeautifulSoup(contents, 'html.parser')
 
# Finding a pattern(certain text)
pattern = 'Geeks For Geeks'
 
# Anchor tag
text1 = soup.find_all('a', text=pattern)
print(text1)
 
# Span tag
text2 = soup.find_all('span', text=pattern)
print(text2)
 
# Finding a pattern(certain text)
pattern2 = 'Python Program'
 
# Heading tag
text3 = soup.find_all('h1', text=pattern2)
print(text3)
 
# List tag
text4 = soup.find_all('li', text=pattern2)
print(text4)
 
# Finding a pattern(certain text)
pattern3 = 'GFG Website'
 
# Table(row) tag
text5 = soup.find_all('tr', text=pattern3)
print(text5)

Output:

[<a href="https://www.w3wiki.nethttps://www.w3wiki.org/">Geeks For Geeks</a>, <a href="https://www.w3wiki.netDummy Check Text">Geeks For Geeks</a>]
[<span class="true">Geeks For Geeks</span>, <span class="false">Geeks For Geeks</span>]
[<h1>Python Program</h1>]
[<li class="1">Python Program</li>]
[<tr>GFG Website</tr>]

How to find a HTML tag that contains certain text using BeautifulSoup ?

BeautifulSoup, a powerful Python library for web scraping, simplifies the process of parsing HTML and XML documents. One common task is to find an HTML tag that contains specific text. In this article, we’ll explore how to achieve this using BeautifulSoup, providing a step-by-step guide.

Required Python Package

pip install beautifulsoup4

Scraping HTML Text using BeautifulSoup

Sample HTML File

HTML

Finding Anchor Tag Containing Particular Text

Python3

Finding All Tag Containing the Text

Python3

Finding the First 3 Anchor Tags

Python3

Find a HTML Tag that Contains Certain Text

Python3

How to find a HTML tag that contains certain text using BeautifulSoup ?

Categories

Contact US

Scraping HTML Text using BeautifulSoup

Sample HTML File

HTML

Finding Anchor Tag Containing Particular Text

Python3

Finding All Tag Containing the Text

Python3

Finding the First 3 Anchor Tags

Python3

Find a HTML Tag that Contains Certain Text

Python3

How to find a HTML tag that contains certain text using BeautifulSoup ?

Similar Reads

Categories

Contact US