Scraping HTML Text using BeautifulSoup
We’ll show you how to pull out various pieces of text from web pages. We’ll go through the process using the BeautifulSoup from our sample HTML page example.
Sample HTML File
Below is the HTML file that we have used to find an HTML tag that contains certain text using BeautifulSoup.
HTML
<!DOCTYPE html> < html lang = "en" > < head > < meta charset = "UTF-8" > < meta http-equiv = "X-UA-Compatible" content = "IE=edge" > < meta name = "viewport" content = "width=device-width, initial-scale=1.0" > < title >GFG </ title > </ head > < body > < a href = "Dummy Check Text" >Geeks For Geeks</ a > < a href = "Dummywebsite.com" >Dummy Text</ a > < h1 >Hello</ h1 > < h1 >Python Program</ h1 > < span class = true >Geeks For Geeks</ span > < span class = false >Geeks For Geeks</ span > < li class = 1 >Python Program</ li > < li class = 2 >Python Code</ li > < table > < tr >GFG Website</ tr > </ table > </ body > </ html > |
Output:
Finding Anchor Tag Containing Particular Text
In this example, we are using BeautifulSoup to parse the content of an HTML file named gfg.html
. By this we can find how to get meta by name beautiful soup. Specifically, we are searching for an anchor tag (<a>
) within this HTML file that contains the text “Geeks For Geeks”. Once the tag is found, it is printed to the console.
Methods Used
- Open( filename, mode ): It opens the given filename in that mode which we have passed.
- find_all ( ): It finds all the patterns in the file that will match with the passed expression.
Python3
from bs4 import BeautifulSoup # Reading the content of gfg.html with open ( "gfg.html" , "r" ) as file : content = file .read() soup = BeautifulSoup(content, 'html.parser' ) # Finding an anchor tag containing the text "Geeks For Geeks" anchor_tag = soup.find( 'a' , text = 'Geeks For Geeks' ) print (anchor_tag) |
Output:
<a href="https://www.w3wiki.nethttps://www.w3wiki.org/">Geeks For Geeks</a>
Finding All Tag Containing the Text
In this example, we are utilizing BeautifulSoup’s find
method to search for any HTML tag within the gfg.html
content that contains the text “Geeks For Geeks”. Once the tag is located, it is printed to the console.
Python3
# Finding a tag containing the text "Geeks For Geeks" text_tag = soup.find( lambda tag: tag.name and "Geeks For Geeks" in tag.text) print (text_tag) |
Output:
<a href="https://www.w3wiki.nethttps://www.w3wiki.org/">Geeks For Geeks</a>
Finding the First 3 Anchor Tags
In this example, we are using BeautifulSoup’s find_all
method to locate the first three anchor tags (<a>
) within the gfg.html
content. The limit=3
parameter ensures that only the first three tags are retrieved. Subsequently, each of these tags is printed to the console.
Python3
# Finding first 3 anchor tags limited_tags = soup.find_all( 'a' , limit = 3 ) for tag in limited_tags: print (tag) |
Output:
<a href="https://www.w3wiki.nethttps://www.w3wiki.org/">Geeks For Geeks</a>
<a href="https://www.w3wiki.netDummy Check Text">Geeks For Geeks</a>
<a href="https://www.w3wiki.netDummywebsite.com">Dummy Text</a>
Find a HTML Tag that Contains Certain Text
In this example, BeautifulSoup is used to search gfg.html
for specific text patterns in different HTML tags, and the found tags are printed to the console.
Python3
# Importing library from bs4 import BeautifulSoup import re # Opening and reading the html file file = open ( "gfg.html" , "r" ) contents = file .read() soup = BeautifulSoup(contents, 'html.parser' ) # Finding a pattern(certain text) pattern = 'Geeks For Geeks' # Anchor tag text1 = soup.find_all( 'a' , text = pattern) print (text1) # Span tag text2 = soup.find_all( 'span' , text = pattern) print (text2) # Finding a pattern(certain text) pattern2 = 'Python Program' # Heading tag text3 = soup.find_all( 'h1' , text = pattern2) print (text3) # List tag text4 = soup.find_all( 'li' , text = pattern2) print (text4) # Finding a pattern(certain text) pattern3 = 'GFG Website' # Table(row) tag text5 = soup.find_all( 'tr' , text = pattern3) print (text5) |
Output:
[<a href="https://www.w3wiki.nethttps://www.w3wiki.org/">Geeks For Geeks</a>, <a href="https://www.w3wiki.netDummy Check Text">Geeks For Geeks</a>]
[<span class="true">Geeks For Geeks</span>, <span class="false">Geeks For Geeks</span>]
[<h1>Python Program</h1>]
[<li class="1">Python Program</li>]
[<tr>GFG Website</tr>]
How to find a HTML tag that contains certain text using BeautifulSoup ?
BeautifulSoup, a powerful Python library for web scraping, simplifies the process of parsing HTML and XML documents. One common task is to find an HTML tag that contains specific text. In this article, we’ll explore how to achieve this using BeautifulSoup, providing a step-by-step guide.
Required Python Package
pip install beautifulsoup4