Data Cleaning in SQL - Smart Tutorials

Getting your Trinity Audio player ready...

If you work with:

Sales reports
Customer databases
MIS dashboards
Government project data
FMCG company systems

You already know this truth:

Real-world data is messy.

You will face:

NULL values
Wrong date formats
Extra spaces
Mixed text cases
Incorrect data types
Blank strings
Duplicates
Logical inconsistencies

Before analysis, you must clean data.

This article will teach you how to clean data directly inside SQL using:

IFNULL / COALESCE
CASE statement
NULLIF
TRIM
CAST
Date conversion
String formatting
Conditional corrections

We will use the FMCG company dataset:

customers
products
employees
sales

Understanding NULL in SQL

NULL does NOT mean zero.
NULL does NOT mean empty string.
NULL means “Unknown”.

Example:

SELECT customer_name, email
FROM customers;

SELECT customer_name, email
FROM customers;

Some customers may have:

email = NULL
email = ” (empty string)

These are different.

Also Read: Select Data in SQL

Handling NULL Values

COALESCE

COALESCE returns the first non-null value.

Example: Replace NULL Email

SELECT 
    customer_name,
    COALESCE(email, 'Not Provided') AS email_status
FROM customers;

SELECT 
    customer_name,
    COALESCE(email, 'Not Provided') AS email_status
FROM customers;

If email is NULL → show “Not Provided”

This works in both PostgreSQL and MySQL.

IFNULL (MySQL Only)

In MySQL:

SELECT 
    customer_name,
    IFNULL(email, 'Not Provided') AS email_status
FROM customers;

SELECT 
    customer_name,
    IFNULL(email, 'Not Provided') AS email_status
FROM customers;

⚠ PostgreSQL does NOT support IFNULL.
Use COALESCE instead.

Since PostgreSQL is primary for Smart Tutorials → always prefer COALESCE.

CASE Statement

CASE is used for conditional logic.

Think of it as SQL’s version of:

if-else

Example: Categorizing Customers by Sales

SELECT 
    customer_id,
    SUM(sale_amount) AS total_sales,
    CASE
        WHEN SUM(sale_amount) > 100000 THEN 'Premium'
        WHEN SUM(sale_amount) > 50000 THEN 'Gold'
        ELSE 'Regular'
    END AS customer_category
FROM sales
GROUP BY customer_id;

SELECT 
    customer_id,
    SUM(sale_amount) AS total_sales,
    CASE
        WHEN SUM(sale_amount) > 100000 THEN 'Premium'
        WHEN SUM(sale_amount) > 50000 THEN 'Gold'
        ELSE 'Regular'
    END AS customer_category
FROM sales
GROUP BY customer_id;

Now you’re applying business logic directly in SQL.

This works in PostgreSQL and MySQL.

Removing Extra Spaces (TRIM)

Sometimes names are stored like:

‘ Rahul Sharma ‘

That creates reporting problems.

TRIM Example

SELECT 
    TRIM(customer_name) AS cleaned_name
FROM customers;

SELECT 
    TRIM(customer_name) AS cleaned_name
FROM customers;

You can also use:

LTRIM(customer_name)
RTRIM(customer_name)

LTRIM(customer_name)
RTRIM(customer_name)

Works in both PostgreSQL and MySQL.

Fixing Text Case – UPPER, LOWER

Data may look like:

raipur
RAIPUR
RaIpUr

We want consistency.

PostgreSQL + MySQL

SELECT 
    UPPER(city) AS city_upper,
    LOWER(city) AS city_lower,
    INITCAP(city) -- PostgreSQL only
FROM customers;

SELECT 
    UPPER(city) AS city_upper,
    LOWER(city) AS city_lower,
    INITCAP(city) -- PostgreSQL only
FROM customers;

⚠ INITCAP works only in PostgreSQL.

MySQL alternative:

SELECT 
    CONCAT(UPPER(LEFT(city,1)), LOWER(SUBSTRING(city,2))) AS city_proper
FROM customers;

SELECT 
    CONCAT(UPPER(LEFT(city,1)), LOWER(SUBSTRING(city,2))) AS city_proper
FROM customers;

NULLIF (Convert Value to NULL)

NULLIF converts a specific value into NULL.

Example:

If empty string should be treated as NULL.

SELECT 
    NULLIF(email, '') AS cleaned_email
FROM customers;

SELECT 
    NULLIF(email, '') AS cleaned_email
FROM customers;

If email = ” → becomes NULL.

This is extremely useful before using COALESCE.

Converting Data Types (CAST)

Sometimes data is stored incorrectly.

Example:

price stored as TEXT instead of NUMERIC.

PostgreSQL

SELECT 
    CAST(price AS NUMERIC) AS numeric_price
FROM products;

SELECT 
    CAST(price AS NUMERIC) AS numeric_price
FROM products;

Or:

price::NUMERIC

price::NUMERIC

MySQL

SELECT 
    CAST(price AS DECIMAL(10,2)) AS numeric_price
FROM products;

SELECT 
    CAST(price AS DECIMAL(10,2)) AS numeric_price
FROM products;

Cleaning Date Formats

Real-life problem:

Date stored as text:

’05-Mar-2025 10:30 AM’

MySQL Conversion

SELECT 
    STR_TO_DATE(order_date, '%d-%b-%Y %h:%i %p') AS clean_date
FROM sales;

SELECT 
    STR_TO_DATE(order_date, '%d-%b-%Y %h:%i %p') AS clean_date
FROM sales;

PostgreSQL Conversion

SELECT 
    TO_TIMESTAMP(order_date, 'DD-Mon-YYYY HH12:MI AM') AS clean_date
FROM sales;

SELECT 
    TO_TIMESTAMP(order_date, 'DD-Mon-YYYY HH12:MI AM') AS clean_date
FROM sales;

This is extremely important in MIS reporting.

Handling Blank Strings vs NULL

Sometimes:

email = ”
instead of
email = NULL

Correct approach:

SELECT 
    COALESCE(NULLIF(email, ''), 'Not Provided') AS clean_email
FROM customers;

SELECT 
    COALESCE(NULLIF(email, ''), 'Not Provided') AS clean_email
FROM customers;

First:

NULLIF(email, ”) → converts blank to NULL
Then COALESCE replaces NULL.

Professional-level cleaning.

Removing Duplicates

If duplicate customers exist:

PostgreSQL

SELECT DISTINCT *
FROM customers;

SELECT DISTINCT *
FROM customers;

If deleting duplicates:

DELETE FROM customers c1
USING customers c2
WHERE c1.customer_id > c2.customer_id
AND c1.email = c2.email;

DELETE FROM customers c1
USING customers c2
WHERE c1.customer_id > c2.customer_id
AND c1.email = c2.email;

MySQL

DELETE c1
FROM customers c1
JOIN customers c2
ON c1.email = c2.email
AND c1.customer_id > c2.customer_id;

DELETE c1
FROM customers c1
JOIN customers c2
ON c1.email = c2.email
AND c1.customer_id > c2.customer_id;

Be careful while deleting duplicates.

Practical FMCG Cleaning Scenario

Imagine sales table has:

sale_amount as TEXT
blank city names
NULL quantity
inconsistent date format

Let’s clean in one query.

PostgreSQL Example

SELECT 
    sale_id,
    customer_id,
    CAST(sale_amount AS NUMERIC) AS clean_sale_amount,
    COALESCE(quantity, 0) AS clean_quantity,
    TRIM(UPPER(city)) AS clean_city,
    TO_TIMESTAMP(order_date, 'DD-Mon-YYYY HH12:MI AM') AS clean_order_date
FROM sales;

SELECT 
    sale_id,
    customer_id,
    CAST(sale_amount AS NUMERIC) AS clean_sale_amount,
    COALESCE(quantity, 0) AS clean_quantity,
    TRIM(UPPER(city)) AS clean_city,
    TO_TIMESTAMP(order_date, 'DD-Mon-YYYY HH12:MI AM') AS clean_order_date
FROM sales;

Now the dataset is analysis-ready.

Using CASE for Business Fixes

Example:

If quantity is NULL or 0 → mark as invalid.

SELECT 
    sale_id,
    quantity,
    CASE
        WHEN quantity IS NULL OR quantity = 0 THEN 'Invalid'
        ELSE 'Valid'
    END AS quantity_status
FROM sales;

SELECT 
    sale_id,
    quantity,
    CASE
        WHEN quantity IS NULL OR quantity = 0 THEN 'Invalid'
        ELSE 'Valid'
    END AS quantity_status
FROM sales;

This helps create data quality reports.

Handling Negative Values

Sometimes sale_amount becomes negative due to error.

SELECT 
    sale_id,
    ABS(sale_amount) AS corrected_amount
FROM sales;

SELECT 
    sale_id,
    ABS(sale_amount) AS corrected_amount
FROM sales;

ABS works in both PostgreSQL and MySQL.

Creating Clean Reporting View

Instead of cleaning every time, create a View.

CREATE VIEW clean_sales AS
SELECT 
    sale_id,
    CAST(sale_amount AS NUMERIC) AS sale_amount,
    COALESCE(quantity, 0) AS quantity,
    TRIM(UPPER(city)) AS city
FROM sales;

CREATE VIEW clean_sales AS
SELECT 
    sale_id,
    CAST(sale_amount AS NUMERIC) AS sale_amount,
    COALESCE(quantity, 0) AS quantity,
    TRIM(UPPER(city)) AS city
FROM sales;

Now use:

SELECT * FROM clean_sales;

SELECT * FROM clean_sales;

This is how professionals build reporting layers.

PostgreSQL vs MySQL Differences Summary

Function	PostgreSQL	MySQL
COALESCE	✅	✅
IFNULL	❌	✅
INITCAP	✅	❌
STR_TO_DATE	❌	✅
TO_TIMESTAMP	✅	❌
CAST	✅	✅
NULLIF	✅	✅

PostgreSQL is stricter and more powerful for analytics.

Common Data Cleaning Mistakes

❌ Treating NULL as zero without checking
❌ Not trimming text
❌ Ignoring blank strings
❌ Not converting date types
❌ Running analysis on TEXT numbers
❌ Deleting duplicates without backup

Always clean carefully.

Why Data Cleaning is Critical for Data Analysts

In real jobs:

60–70% time goes in cleaning
20% in analysis
10% in reporting

If your data is wrong → your insights are wrong.

And wrong insights → wrong business decisions.

Final Summary

Data Cleaning in SQL is the process of preparing raw, messy data for accurate analysis. Using functions like COALESCE, CASE, NULLIF, TRIM, CAST, and date conversion tools, analysts can handle NULL values, fix inconsistent text, convert incorrect data types, remove duplicates, and apply business rules.

Both PostgreSQL and MySQL provide strong data cleaning capabilities, though syntax may differ slightly. Mastering data cleaning is essential for building reliable MIS reports, dashboards, and analytical queries. Clean data ensures accurate reporting, better insights, and professional-level SQL skills.

What’s Next?

👉 Views in SQL

Spread the love

Understanding NULL in SQL

Handling NULL Values

COALESCE

Example: Replace NULL Email

IFNULL (MySQL Only)

CASE Statement

Example: Categorizing Customers by Sales

Removing Extra Spaces (TRIM)

TRIM Example

Fixing Text Case – UPPER, LOWER

PostgreSQL + MySQL

NULLIF (Convert Value to NULL)

Converting Data Types (CAST)

PostgreSQL

MySQL

Cleaning Date Formats

MySQL Conversion

PostgreSQL Conversion

Handling Blank Strings vs NULL

Removing Duplicates

PostgreSQL

MySQL

Practical FMCG Cleaning Scenario

PostgreSQL Example

Using CASE for Business Fixes

Handling Negative Values

Creating Clean Reporting View

PostgreSQL vs MySQL Differences Summary

Common Data Cleaning Mistakes

Why Data Cleaning is Critical for Data Analysts

Final Summary

What’s Next?

Leave a Comment Cancel Reply