EDA & Feature Engineering
Last Updated : 11th September 2025
EDA ?
A Exploratory Data Analysis is a statistical analysis of a dataset to understand its structure and characteristics, including its size, distribution, and relationships between variables.
Let Data is like this
import pandas as pd
df = pd.DataFrame({
"Age": [20, 22, 21, 19, 23],
"Marks": [85, 90, 78, 88, 92]
})
Summary Statistics
# Basic statistics
print(df.mean()) # mean per column
print(df.median()) # median per column
print(df.min()) # min
print(df.max()) # max
print(df.std()) # standard deviation
Value distribution
print(df["Marks"].value_counts()) # frequency of each score
print(df["Marks"].value_counts(normalize=True)) # percentage
Correlation
# Correlation between numeric columns
print(df.corr())
Group wise statistics
df2 = pd.DataFrame({
"Department": ["IT","HR","IT","Finance","HR"],
"Salary": [50000,45000,52000,60000,47000]
})
# Average salary per department
print(df2.groupby("Department")["Salary"].mean())
Outlier Detection
Using IQR Method
Q1 = df["Marks"].quantile(0.25)
Q3 = df["Marks"].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df["Marks"] < Q1 - 1.5*IQR) | (df["Marks"] > Q3 + 1.5*IQR)]
print(outliers)
Using Z-Score Method
z_scores = (df["Marks"] - df["Marks"].mean()) / df["Marks"].std()
outliers = df[abs(z_scores) > 3]
print(outliers)
Feature Engineering
It is the process of creating new features from existing features in order to improve the performance of a machine learning model.
# Creating new columns
df["Pass"] = df["Marks"].apply(lambda x: 1 if x>=80 else 0)
# Encoding categorical variables
df2["Dept_Code"] = df2["Department"].map({"IT":1,"HR":2,"Finance":3})
# One-hot encoding
df_encoded = pd.get_dummies(df2, columns=["Department"])
print(df_encoded)