EDA & Feature Engineering

Last Updated : 11th September 2025


EDA ?

A Exploratory Data Analysis is a statistical analysis of a dataset to understand its structure and characteristics, including its size, distribution, and relationships between variables.

Let Data is like this

import pandas as pd

df = pd.DataFrame({
    "Age": [20, 22, 21, 19, 23],
    "Marks": [85, 90, 78, 88, 92]
})

Summary Statistics

# Basic statistics
print(df.mean())      # mean per column
print(df.median())    # median per column
print(df.min())       # min
print(df.max())       # max
print(df.std())       # standard deviation

Value distribution

print(df["Marks"].value_counts())   # frequency of each score
print(df["Marks"].value_counts(normalize=True)) # percentage

Correlation

# Correlation between numeric columns
print(df.corr())

Group wise statistics

df2 = pd.DataFrame({
    "Department": ["IT","HR","IT","Finance","HR"],
    "Salary": [50000,45000,52000,60000,47000]
})

# Average salary per department
print(df2.groupby("Department")["Salary"].mean())

Outlier Detection

Using IQR Method

Q1 = df["Marks"].quantile(0.25)
Q3 = df["Marks"].quantile(0.75)
IQR = Q3 - Q1

outliers = df[(df["Marks"] < Q1 - 1.5*IQR) | (df["Marks"] > Q3 + 1.5*IQR)]
print(outliers)

Using Z-Score Method

z_scores = (df["Marks"] - df["Marks"].mean()) / df["Marks"].std()
outliers = df[abs(z_scores) > 3]
print(outliers)

Feature Engineering

It is the process of creating new features from existing features in order to improve the performance of a machine learning model.

# Creating new columns
df["Pass"] = df["Marks"].apply(lambda x: 1 if x>=80 else 0)

# Encoding categorical variables
df2["Dept_Code"] = df2["Department"].map({"IT":1,"HR":2,"Finance":3})

# One-hot encoding
df_encoded = pd.get_dummies(df2, columns=["Department"])
print(df_encoded)