BeautifulSoup Module in Python

Getting your Trinity Audio player ready...

Web scraping is one of the most important skills for Python learners, data analysts, MIS professionals, and anyone who wants to create datasets for machine learning or fine-tuning models like Tejas AI.
Among all web scraping tools in Python, BeautifulSoup (from the bs4 library) is the most popular because it is simple, powerful, and beginner-friendly.

BeautifulSoup Module in Python

Let’s start!

1. What is BeautifulSoup?

BeautifulSoup is a Python library used to parse (read) HTML and extract required data from web pages.

It takes HTML code like this:

<h1 class="title">Welcome to Smart Tutorials</h1>

And makes it easy to search:

soup.find("h1")

or

soup.find("h1", class_="title")

BeautifulSoup is not a web downloader.
It only helps you parse HTML.

To download websites, we use:

  • requests (most common)
  • urllib
  • browser automation tools like Selenium (for dynamic JavaScript sites)

2. Installing BeautifulSoup

Install with pip:

pip install beautifulsoup4

Optional (but recommended):

pip install requests

Now you are ready to begin.

Read More: Requests Module in Python


3. Understanding the Structure of an HTML Page

BeautifulSoup works best when you know how HTML is structured.

A simple webpage looks like:

<html>
  <head>
    <title>Sample Page</title>
  </head>

  <body>
    <h1>Heading</h1>
    <p class="description">This is a paragraph.</p>
    <a href="https://smarttejas.com">Visit Smart Tejas</a>
  </body>
</html>

Key things you will scrape:

  • tags (h1, p, a)
  • class attributes
  • id attributes
  • texts
  • links (href)
  • images (src)

4. Creating Your First BeautifulSoup Object

Example:

from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <h1>Hello World</h1>
    <p>This is a test.</p>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")

Now soup stores the whole HTML in a structured form.

Try printing it:

print(soup.prettify())

5. Parsing and Selecting HTML Elements

BeautifulSoup gives several functions:

5.1 find()

Returns the first matching element.

soup.find("h1")

5.2 find_all()

Returns all matching elements.

soup.find_all("p")

5.3 Find by class

soup.find("p", class_="description")

5.4 Find by id

soup.find("div", id="main")

5.5 CSS selectors (select and select_one)

soup.select("div.article h2.title")

Super powerful and used in advanced scraping.


6. Extracting Text and Attributes

6.1 Extract text

title = soup.find("h1").text

or

title = soup.find("h1").get_text(strip=True)

6.2 Extract attribute (like href)

link = soup.find("a")["href"]

Example:

<img src="image.jpg" alt="Smart Tutorials Logo">

soup.find("img")["src"]

7. Web Scraping with Requests + BeautifulSoup

This is the most common pattern:

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

Extract Title

title = soup.find("title").text
print(title)

8. Full Practical Example: Scraping Articles

Imagine you want to scrape articles from a blog:

import requests
from bs4 import BeautifulSoup

url = "https://smarttejas.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

articles = soup.find_all("h2", class_="post-title")

for item in articles:
    print(item.text)

9. Extracting Links from a Page

links = soup.find_all("a")

for link in links:
    print(link.text, " -> ", link.get("href"))

10. Scraping Tables (Important for MIS & Data Analysis)

Most MIS reports online appear as HTML tables.

Example:

table = soup.find("table")

rows = table.find_all("tr")

for row in rows:
    cols = row.find_all("td")
    cols = [c.text.strip() for c in cols]
    print(cols)

This is extremely useful for:

  • Government data scraping
  • Policy scraping
  • Sales tables
  • Incentive data
  • Dashboard automation

11. Scraping Pagination (Multiple Pages)

Many websites show:

  • Page 1
  • Page 2
  • Page 3

Example:

base = "https://example.com/page="

for page in range(1, 6):
    url = base + str(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
    items = soup.find_all("div", class_="product")
    
    for item in items:
        print(item.text)

Pagination scraping is essential when:

  • Collecting training datasets
  • Scraping multiple articles
  • Gathering product reviews
  • Making datasets for fine-tuning Tejas AI

12. Saving Scraped Data to CSV

import csv

with open("data.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["Title", "Link"])

    for a in soup.find_all("a"):
        writer.writerow([a.text, a.get("href")])

13. Handling Errors and Exceptions

Try/Except

try:
    price = soup.find("span", class_="price").text
except:
    price = None

Read More: Error Handling in Python

Common errors:

ErrorMeaning
NoneType has no attributeitem not found
requests.exceptions.ConnectionErrorwebsite offline
requests.exceptions.Timeoutpage taking too long
Blocked by websiteneed headers or Selenium

14. Adding Headers to Avoid Blocks

Many websites block bots.
Add browser-like headers:

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

response = requests.get(url, headers=headers)

15. When NOT to Use BeautifulSoup

BeautifulSoup cannot scrape:

  • JavaScript-rendered sites
  • Dynamic content
  • Sites behind login
  • Sites blocking bots

For such cases, you must use:

  • Selenium
  • Playwright
  • API endpoints

16. Creating Datasets for Fine-Tuning Tejas AI Using BeautifulSoup

You can build your own dataset like this:

Step 1: Scrape website or text

Step 2: Clean text

Step 3: Convert to JSONL

Step 4: Feed into fine-tuning pipeline

BeautifulSoup is perfect for:

  • Scraping your Smart Tutorials articles
  • Scraping coding examples
  • Scraping MIS explanations
  • Scraping Excel formula questions
  • Scraping FAQ pages

Dataset Format Example (JSONL)

{"prompt":"What is VLOOKUP?","response":"VLOOKUP is used to lookup a value vertically..."}
{"prompt":"Write Python code for requests module.","response":"Here is a simple code example..."}

BeautifulSoup helps you extract:

  • Titles
  • Content
  • Code blocks
  • Tags
  • Summaries

Then you prepare this content for fine-tuning.

Read More: JSON Module in Python


17. Real Example: Scraping Your Smart Tutorials Website

You can use this to build Tejas AI dataset:

import requests
from bs4 import BeautifulSoup

url = "https://smarttejas.com"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

posts = soup.select("h2.entry-title a")

for p in posts:
    print("Title:", p.text)
    print("URL:", p["href"])

Then for each URL, scrape:

  • Title
  • Main content
  • Code
  • Summary
  • Tags

This becomes your fine-tuning dataset.


18. Cleaning Scraped Content

You may want to remove:

  • Ads
  • Navigation menus
  • Sidebar
  • Footer

BeautifulSoup functions like:

for unwanted in soup.select("header, footer, nav"):
    unwanted.decompose()

19. Conclusion

BeautifulSoup is one of the most powerful and beginner-friendly tools for web scraping in Python. With just a few lines of code, you can extract:

  • Text
  • Titles
  • Tables
  • Images
  • Links
  • Article content

It is perfect for:

  • Creating datasets for fine-tuning Tejas AI
  • Scraping content for Smart Tutorials
  • MIS data extraction
  • Data analysis and Excel reports
  • Collecting JSONL training files

You now know everything—from basic parsing to advanced scraping and dataset creation. With this skill, you can automate content collection, build AI-ready datasets, and fully prepare for your fine-tuning pipeline.

What’s Next?

In the next post, we’ll learn about the OOPs in Python

Spread the love

Leave a Comment

Your email address will not be published. Required fields are marked *

Translate »
Scroll to Top