Data Cleaning in SQL

Getting your Trinity Audio player ready...

If you work with:

  • Sales reports
  • Customer databases
  • MIS dashboards
  • Government project data
  • FMCG company systems

You already know this truth:

Real-world data is messy.

You will face:

  • NULL values
  • Wrong date formats
  • Extra spaces
  • Mixed text cases
  • Incorrect data types
  • Blank strings
  • Duplicates
  • Logical inconsistencies

Before analysis, you must clean data.

This article will teach you how to clean data directly inside SQL using:

  • IFNULL / COALESCE
  • CASE statement
  • NULLIF
  • TRIM
  • CAST
  • Date conversion
  • String formatting
  • Conditional corrections

We will use the FMCG company dataset:

  • customers
  • products
  • employees
  • sales

Understanding NULL in SQL

NULL does NOT mean zero.
NULL does NOT mean empty string.
NULL means “Unknown”.

Example:

SELECT customer_name, email
FROM customers;

Some customers may have:

  • email = NULL
  • email = ” (empty string)

These are different.

Also Read: Select Data in SQL


Handling NULL Values

COALESCE

COALESCE returns the first non-null value.

Example: Replace NULL Email

SELECT 
    customer_name,
    COALESCE(email, 'Not Provided') AS email_status
FROM customers;

If email is NULL → show “Not Provided”

This works in both PostgreSQL and MySQL.


IFNULL (MySQL Only)

In MySQL:

SELECT 
    customer_name,
    IFNULL(email, 'Not Provided') AS email_status
FROM customers;

⚠ PostgreSQL does NOT support IFNULL.
Use COALESCE instead.

Since PostgreSQL is primary for Smart Tutorials → always prefer COALESCE.


CASE Statement

CASE is used for conditional logic.

Think of it as SQL’s version of:

if-else


Example: Categorizing Customers by Sales

SELECT 
    customer_id,
    SUM(sale_amount) AS total_sales,
    CASE
        WHEN SUM(sale_amount) > 100000 THEN 'Premium'
        WHEN SUM(sale_amount) > 50000 THEN 'Gold'
        ELSE 'Regular'
    END AS customer_category
FROM sales
GROUP BY customer_id;

Now you’re applying business logic directly in SQL.

This works in PostgreSQL and MySQL.


Removing Extra Spaces (TRIM)

Sometimes names are stored like:

‘ Rahul Sharma ‘

That creates reporting problems.


TRIM Example

SELECT 
    TRIM(customer_name) AS cleaned_name
FROM customers;

You can also use:

LTRIM(customer_name)
RTRIM(customer_name)

Works in both PostgreSQL and MySQL.


Fixing Text Case – UPPER, LOWER

Data may look like:

  • raipur
  • RAIPUR
  • RaIpUr

We want consistency.


PostgreSQL + MySQL

SELECT 
    UPPER(city) AS city_upper,
    LOWER(city) AS city_lower,
    INITCAP(city) -- PostgreSQL only
FROM customers;

⚠ INITCAP works only in PostgreSQL.

MySQL alternative:

SELECT 
    CONCAT(UPPER(LEFT(city,1)), LOWER(SUBSTRING(city,2))) AS city_proper
FROM customers;

NULLIF (Convert Value to NULL)

NULLIF converts a specific value into NULL.

Example:

If empty string should be treated as NULL.

SELECT 
    NULLIF(email, '') AS cleaned_email
FROM customers;

If email = ” → becomes NULL.

This is extremely useful before using COALESCE.


Converting Data Types (CAST)

Sometimes data is stored incorrectly.

Example:

price stored as TEXT instead of NUMERIC.


PostgreSQL

SELECT 
    CAST(price AS NUMERIC) AS numeric_price
FROM products;

Or:

price::NUMERIC

MySQL

SELECT 
    CAST(price AS DECIMAL(10,2)) AS numeric_price
FROM products;

Cleaning Date Formats

Real-life problem:

Date stored as text:

’05-Mar-2025 10:30 AM’


MySQL Conversion

SELECT 
    STR_TO_DATE(order_date, '%d-%b-%Y %h:%i %p') AS clean_date
FROM sales;

PostgreSQL Conversion

SELECT 
    TO_TIMESTAMP(order_date, 'DD-Mon-YYYY HH12:MI AM') AS clean_date
FROM sales;

This is extremely important in MIS reporting.


Handling Blank Strings vs NULL

Sometimes:

email = ”
instead of
email = NULL

Correct approach:

SELECT 
    COALESCE(NULLIF(email, ''), 'Not Provided') AS clean_email
FROM customers;

First:

NULLIF(email, ”) → converts blank to NULL
Then COALESCE replaces NULL.

Professional-level cleaning.


Removing Duplicates

If duplicate customers exist:

PostgreSQL

SELECT DISTINCT *
FROM customers;

If deleting duplicates:

DELETE FROM customers c1
USING customers c2
WHERE c1.customer_id > c2.customer_id
AND c1.email = c2.email;

MySQL

DELETE c1
FROM customers c1
JOIN customers c2
ON c1.email = c2.email
AND c1.customer_id > c2.customer_id;

Be careful while deleting duplicates.


Practical FMCG Cleaning Scenario

Imagine sales table has:

  • sale_amount as TEXT
  • blank city names
  • NULL quantity
  • inconsistent date format

Let’s clean in one query.


PostgreSQL Example

SELECT 
    sale_id,
    customer_id,
    CAST(sale_amount AS NUMERIC) AS clean_sale_amount,
    COALESCE(quantity, 0) AS clean_quantity,
    TRIM(UPPER(city)) AS clean_city,
    TO_TIMESTAMP(order_date, 'DD-Mon-YYYY HH12:MI AM') AS clean_order_date
FROM sales;

Now the dataset is analysis-ready.


Using CASE for Business Fixes

Example:

If quantity is NULL or 0 → mark as invalid.

SELECT 
    sale_id,
    quantity,
    CASE
        WHEN quantity IS NULL OR quantity = 0 THEN 'Invalid'
        ELSE 'Valid'
    END AS quantity_status
FROM sales;

This helps create data quality reports.


Handling Negative Values

Sometimes sale_amount becomes negative due to error.

SELECT 
    sale_id,
    ABS(sale_amount) AS corrected_amount
FROM sales;

ABS works in both PostgreSQL and MySQL.


Creating Clean Reporting View

Instead of cleaning every time, create a View.

CREATE VIEW clean_sales AS
SELECT 
    sale_id,
    CAST(sale_amount AS NUMERIC) AS sale_amount,
    COALESCE(quantity, 0) AS quantity,
    TRIM(UPPER(city)) AS city
FROM sales;

Now use:

SELECT * FROM clean_sales;

This is how professionals build reporting layers.


PostgreSQL vs MySQL Differences Summary

FunctionPostgreSQLMySQL
COALESCE
IFNULL
INITCAP
STR_TO_DATE
TO_TIMESTAMP
CAST
NULLIF

PostgreSQL is stricter and more powerful for analytics.


Common Data Cleaning Mistakes

❌ Treating NULL as zero without checking
❌ Not trimming text
❌ Ignoring blank strings
❌ Not converting date types
❌ Running analysis on TEXT numbers
❌ Deleting duplicates without backup

Always clean carefully.


Why Data Cleaning is Critical for Data Analysts

In real jobs:

  • 60–70% time goes in cleaning
  • 20% in analysis
  • 10% in reporting

If your data is wrong → your insights are wrong.

And wrong insights → wrong business decisions.


Final Summary

Data Cleaning in SQL is the process of preparing raw, messy data for accurate analysis. Using functions like COALESCE, CASE, NULLIF, TRIM, CAST, and date conversion tools, analysts can handle NULL values, fix inconsistent text, convert incorrect data types, remove duplicates, and apply business rules.

Both PostgreSQL and MySQL provide strong data cleaning capabilities, though syntax may differ slightly. Mastering data cleaning is essential for building reliable MIS reports, dashboards, and analytical queries. Clean data ensures accurate reporting, better insights, and professional-level SQL skills.


What’s Next?

👉 Views in SQL

Spread the love

Leave a Comment

Your email address will not be published. Required fields are marked *

Translate »
Scroll to Top