BeautifulSoup Module in Python

Getting your Trinity Audio player ready...

Web scraping is one of the most important skills for Python learners, data analysts, MIS professionals, and anyone who wants to create datasets for machine learning or fine-tuning models like Tejas AI.
Among all web scraping tools in Python, BeautifulSoup (from the bs4 library) is the most popular because it is simple, powerful, and beginner-friendly.

Let’s start!

1. What is BeautifulSoup?

BeautifulSoup is a Python library used to parse (read) HTML and extract required data from web pages.

It takes HTML code like this:

<h1 class="title">Welcome to Smart Tutorials</h1>

<h1 class="title">Welcome to Smart Tutorials</h1>

And makes it easy to search:

soup.find("h1")

soup.find("h1")

soup.find("h1", class_="title")

soup.find("h1", class_="title")

BeautifulSoup is not a web downloader.
It only helps you parse HTML.

To download websites, we use:

requests (most common)
urllib
browser automation tools like Selenium (for dynamic JavaScript sites)

2. Installing BeautifulSoup

Install with pip:

pip install beautifulsoup4

pip install beautifulsoup4

Optional (but recommended):

pip install requests

pip install requests

Now you are ready to begin.

Read More: Requests Module in Python

3. Understanding the Structure of an HTML Page

BeautifulSoup works best when you know how HTML is structured.

A simple webpage looks like:

<html>
  <head>
    <title>Sample Page</title>
  </head>

  <body>
    <h1>Heading</h1>
    <p class="description">This is a paragraph.</p>
    <a href="https://smarttejas.com">Visit Smart Tejas</a>
  </body>
</html>

<html>
  <head>
    <title>Sample Page</title>
  </head>

  <body>
    <h1>Heading</h1>
    <p class="description">This is a paragraph.</p>
    <a href="https://smarttejas.com">Visit Smart Tejas</a>
  </body>
</html>

Key things you will scrape:

tags (h1, p, a)
class attributes
id attributes
texts
links (href)
images (src)

4. Creating Your First BeautifulSoup Object

Example:

from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <h1>Hello World</h1>
    <p>This is a test.</p>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")

from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <h1>Hello World</h1>
    <p>This is a test.</p>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")

Now soup stores the whole HTML in a structured form.

Try printing it:

print(soup.prettify())

print(soup.prettify())

5. Parsing and Selecting HTML Elements

BeautifulSoup gives several functions:

5.1 `find()`

Returns the first matching element.

soup.find("h1")

soup.find("h1")

5.2 `find_all()`

Returns all matching elements.

soup.find_all("p")

soup.find_all("p")

5.3 Find by class

soup.find("p", class_="description")

soup.find("p", class_="description")

5.4 Find by id

soup.find("div", id="main")

soup.find("div", id="main")

5.5 CSS selectors (`select` and `select_one`)

soup.select("div.article h2.title")

soup.select("div.article h2.title")

Super powerful and used in advanced scraping.

6. Extracting Text and Attributes

6.1 Extract text

title = soup.find("h1").text

title = soup.find("h1").text

title = soup.find("h1").get_text(strip=True)

title = soup.find("h1").get_text(strip=True)

6.2 Extract attribute (like href)

link = soup.find("a")["href"]

link = soup.find("a")["href"]

Example:

<img src="image.jpg" alt="Smart Tutorials Logo">

soup.find("img")["src"]

<img src="image.jpg" alt="Smart Tutorials Logo">

soup.find("img")["src"]

7. Web Scraping with Requests + BeautifulSoup

This is the most common pattern:

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

Extract Title

title = soup.find("title").text
print(title)

title = soup.find("title").text
print(title)

8. Full Practical Example: Scraping Articles

Imagine you want to scrape articles from a blog:

import requests
from bs4 import BeautifulSoup

url = "https://smarttejas.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

articles = soup.find_all("h2", class_="post-title")

for item in articles:
    print(item.text)

import requests
from bs4 import BeautifulSoup

url = "https://smarttejas.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

articles = soup.find_all("h2", class_="post-title")

for item in articles:
    print(item.text)

9. Extracting Links from a Page

links = soup.find_all("a")

for link in links:
    print(link.text, " -> ", link.get("href"))

links = soup.find_all("a")

for link in links:
    print(link.text, " -> ", link.get("href"))

10. Scraping Tables (Important for MIS & Data Analysis)

Most MIS reports online appear as HTML tables.

Example:

table = soup.find("table")

rows = table.find_all("tr")

for row in rows:
    cols = row.find_all("td")
    cols = [c.text.strip() for c in cols]
    print(cols)

table = soup.find("table")

rows = table.find_all("tr")

for row in rows:
    cols = row.find_all("td")
    cols = [c.text.strip() for c in cols]
    print(cols)

This is extremely useful for:

Government data scraping
Policy scraping
Sales tables
Incentive data
Dashboard automation

11. Scraping Pagination (Multiple Pages)

Many websites show:

Page 1
Page 2
Page 3

Example:

base = "https://example.com/page="

for page in range(1, 6):
    url = base + str(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    items = soup.find_all("div", class_="product")
    
    for item in items:
        print(item.text)

base = "https://example.com/page="

for page in range(1, 6):
    url = base + str(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    items = soup.find_all("div", class_="product")
    
    for item in items:
        print(item.text)

Pagination scraping is essential when:

Collecting training datasets
Scraping multiple articles
Gathering product reviews
Making datasets for fine-tuning Tejas AI

12. Saving Scraped Data to CSV

import csv

with open("data.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Title", "Link"])

    for a in soup.find_all("a"):
        writer.writerow([a.text, a.get("href")])

import csv

with open("data.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Title", "Link"])

    for a in soup.find_all("a"):
        writer.writerow([a.text, a.get("href")])

13. Handling Errors and Exceptions

Try/Except

try:
    price = soup.find("span", class_="price").text
except:
    price = None

try:
    price = soup.find("span", class_="price").text
except:
    price = None

Read More: Error Handling in Python

Common errors:

Error	Meaning
`NoneType has no attribute`	item not found
`requests.exceptions.ConnectionError`	website offline
`requests.exceptions.Timeout`	page taking too long
Blocked by website	need headers or Selenium

14. Adding Headers to Avoid Blocks

Many websites block bots.
Add browser-like headers:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

response = requests.get(url, headers=headers)

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

response = requests.get(url, headers=headers)

15. When NOT to Use BeautifulSoup

BeautifulSoup cannot scrape:

JavaScript-rendered sites
Dynamic content
Sites behind login
Sites blocking bots

For such cases, you must use:

Selenium
Playwright
API endpoints

16. Creating Datasets for Fine-Tuning Tejas AI Using BeautifulSoup

You can build your own dataset like this:

Step 1: Scrape website or text

Step 2: Clean text

Step 3: Convert to JSONL

Step 4: Feed into fine-tuning pipeline

BeautifulSoup is perfect for:

Scraping your Smart Tutorials articles
Scraping coding examples
Scraping MIS explanations
Scraping Excel formula questions
Scraping FAQ pages

Dataset Format Example (JSONL)

{"prompt":"What is VLOOKUP?","response":"VLOOKUP is used to lookup a value vertically..."}
{"prompt":"Write Python code for requests module.","response":"Here is a simple code example..."}

{"prompt":"What is VLOOKUP?","response":"VLOOKUP is used to lookup a value vertically..."}
{"prompt":"Write Python code for requests module.","response":"Here is a simple code example..."}

BeautifulSoup helps you extract:

Titles
Content
Code blocks
Tags
Summaries

Then you prepare this content for fine-tuning.

Read More: JSON Module in Python

17. Real Example: Scraping Your Smart Tutorials Website

You can use this to build Tejas AI dataset:

import requests
from bs4 import BeautifulSoup

url = "https://smarttejas.com"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

posts = soup.select("h2.entry-title a")

for p in posts:
    print("Title:", p.text)
    print("URL:", p["href"])

import requests
from bs4 import BeautifulSoup

url = "https://smarttejas.com"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

posts = soup.select("h2.entry-title a")

for p in posts:
    print("Title:", p.text)
    print("URL:", p["href"])

Then for each URL, scrape:

Title
Main content
Code
Summary
Tags

This becomes your fine-tuning dataset.

18. Cleaning Scraped Content

You may want to remove:

Ads
Navigation menus
Sidebar
Footer

BeautifulSoup functions like:

for unwanted in soup.select("header, footer, nav"):
    unwanted.decompose()

for unwanted in soup.select("header, footer, nav"):
    unwanted.decompose()

19. Conclusion

BeautifulSoup is one of the most powerful and beginner-friendly tools for web scraping in Python. With just a few lines of code, you can extract:

Text
Titles
Tables
Images
Links
Article content

It is perfect for:

Creating datasets for fine-tuning Tejas AI
Scraping content for Smart Tutorials
MIS data extraction
Data analysis and Excel reports
Collecting JSONL training files

You now know everything—from basic parsing to advanced scraping and dataset creation. With this skill, you can automate content collection, build AI-ready datasets, and fully prepare for your fine-tuning pipeline.

What’s Next?

In the next post, we’ll learn about the OOPs in Python

Spread the love

1. What is BeautifulSoup?

2. Installing BeautifulSoup

3. Understanding the Structure of an HTML Page

4. Creating Your First BeautifulSoup Object

5. Parsing and Selecting HTML Elements

5.1 find()

5.2 find_all()

5.3 Find by class

5.4 Find by id

5.5 CSS selectors (select and select_one)

6. Extracting Text and Attributes

6.1 Extract text

6.2 Extract attribute (like href)

7. Web Scraping with Requests + BeautifulSoup

Extract Title

8. Full Practical Example: Scraping Articles

9. Extracting Links from a Page

10. Scraping Tables (Important for MIS & Data Analysis)

11. Scraping Pagination (Multiple Pages)

12. Saving Scraped Data to CSV

13. Handling Errors and Exceptions

Try/Except

Common errors:

14. Adding Headers to Avoid Blocks

15. When NOT to Use BeautifulSoup

16. Creating Datasets for Fine-Tuning Tejas AI Using BeautifulSoup

Step 1: Scrape website or text

Step 2: Clean text

Step 3: Convert to JSONL

Step 4: Feed into fine-tuning pipeline

Dataset Format Example (JSONL)

17. Real Example: Scraping Your Smart Tutorials Website

18. Cleaning Scraped Content

19. Conclusion

What’s Next?

Leave a Comment Cancel Reply

5.1 `find()`

5.2 `find_all()`

5.5 CSS selectors (`select` and `select_one`)