📃 Pandas Data Cleaning
Last Updated : 30th August 2025
Data Cleaning is a process of preprocessing data before it is used for analysis. It involves removing duplicates, handling missing values, and ensuring data quality.There a Some inbuilt Functions:
isnull(): Check missing values.notnull(): Opposite ofisnull().isna(): Check missing (NaN) values.dropna(): Remove rows with missing values.fillna(): Replace missing values with a specified value.ffill(): Forward-fill missing values.bfill(): Backward-fill missing values.duplicated(): Check for duplicate rows.drop_duplicates(): Remove duplicate rows.replace(): Replace specific values with another valuewhere(): Replace values based on a condition.mask(): Replace values based on a boolean mask.
Let Data is like this
import pandas as pd
data = {
"Name": ["Amit", "Neha", "Raj", None,"Amit"],
"Age": [25, None, 30, 22,25],
"City": ["Delhi", "Mumbai", None, "Chennai", "Delhi"]
}
df = pd.DataFrame(data)
📊 Missing data
Check missing (NaN) values.
df.isnull()
Filling and removing missing values
# Drop rows with missing values
print(df.dropna())
# Fill missing values with 0: (Note this fill all the missing values in all the columns at once.)
print(df.fillna(0))
# Fill missing values to a specific column
print(df.fillna({"Age": 0}))
# Fill missing with default values
print(df.fillna({"Name": "Unknown", "Age": 0, "City": "Unknown"}))
# forward fill (This will fill the missing values from the previous row)
print(df.ffill())
# backward fill (This will fill the missing values from the next row)
print(df.bfill())
Handling duplicates
# Check for duplicates
print(df.duplicated())
# Remove duplicates
print(df.drop_duplicates())
Replacing values
# Replace Delhi → New Delhi
print(df.replace("Delhi", "New Delhi"))
# Replace values
print(df.replace({"Name": "Amit", "Age": 25}, "Unknown"))
# Replace multiple values
print(df.replace({"Name": {"Amit": "Unknown"}, "Age": {25: 0}}))
# Replace missing values
print(df.replace({None: "Unknown"}))
# Replace values based on a condition
print(df.where(df["Age"] > 25, "Unknown"))
# Replace values based on a boolean mask
print(df.mask(df["Age"] > 25, "Unknown"))