📃 Pandas Data Cleaning

Last Updated : 30th August 2025


Data Cleaning is a process of preprocessing data before it is used for analysis. It involves removing duplicates, handling missing values, and ensuring data quality.There a Some inbuilt Functions:

  • isnull() : Check missing values.
  • notnull(): Opposite of isnull().
  • isna() : Check missing (NaN) values.
  • dropna() : Remove rows with missing values.
  • fillna() : Replace missing values with a specified value.
  • ffill() : Forward-fill missing values.
  • bfill() : Backward-fill missing values.
  • duplicated(): Check for duplicate rows.
  • drop_duplicates(): Remove duplicate rows.
  • replace() : Replace specific values with another value
  • where(): Replace values based on a condition.
  • mask(): Replace values based on a boolean mask.

Let Data is like this

import pandas as pd

data = {
    "Name": ["Amit", "Neha", "Raj", None,"Amit"],
    "Age": [25, None, 30, 22,25],
    "City": ["Delhi", "Mumbai", None, "Chennai", "Delhi"]
}
df = pd.DataFrame(data)

📊 Missing data

Check missing (NaN) values.

df.isnull()

Filling and removing missing values

# Drop rows with missing values
print(df.dropna())

# Fill missing values with 0: (Note this fill all the missing values in all the columns at once.)
print(df.fillna(0))

# Fill missing values to a specific column
print(df.fillna({"Age": 0}))

# Fill missing with default values
print(df.fillna({"Name": "Unknown", "Age": 0, "City": "Unknown"}))

# forward fill (This will fill the missing values from the previous row)
print(df.ffill())

# backward fill (This will fill the missing values from the next row)
print(df.bfill())

Handling duplicates

# Check for duplicates
print(df.duplicated())

# Remove duplicates
print(df.drop_duplicates())

Replacing values

# Replace Delhi → New Delhi
print(df.replace("Delhi", "New Delhi"))

# Replace values
print(df.replace({"Name": "Amit", "Age": 25}, "Unknown"))

# Replace multiple values
print(df.replace({"Name": {"Amit": "Unknown"}, "Age": {25: 0}}))

# Replace missing values
print(df.replace({None: "Unknown"}))

# Replace values based on a condition
print(df.where(df["Age"] > 25, "Unknown"))

# Replace values based on a boolean mask
print(df.mask(df["Age"] > 25, "Unknown"))