Advance Data Manipulation
Last Updated : 31th August 2025
MultiIndexing
Used to access data in a DataFrame or Series based on multiple index levels.
Creating a MultiIndex From DataFrame
import pandas as pd
data = {
"Department": ["IT", "IT", "HR", "HR", "Finance", "Finance"],
"Employee": ["Amit", "Rahul", "Priya", "Karan", "Sneha", "Megha"],
"Salary": [50000, 52000, 45000, 47000, 60000, 58000]
}
df = pd.DataFrame(data)
# Set MultiIndex
df_multi = df.set_index(["Department", "Employee"])
print(df_multi)
# Access data
print(df_multi.loc["IT"])
print(df_multi.loc[("HR","Priya")])
Creating MultiIndex From array
import pandas as pd
# Example sales data by region & product
arrays = [
["East", "East", "West", "West"],
["Pen", "Book", "Pen", "Book"]
]
index = pd.MultiIndex.from_arrays(arrays, names=("Region", "Product"))
df = pd.DataFrame({"Sales": [120, 200, 150, 300]}, index=index)
print(df)
# Access data
# Select sales for East region
print(df.loc["East"])
# Select only Pen sales across regions, .xs() (cross section) lets you filter by one index level.
print(df.xs("Pen", level="Product"))
Windows Functions
It allows you to perform complex calculations on a group of rows within a DataFrame or Series.
Rolling Windows
👉 Works on fixed window size.
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 4, 5]})
# Try to Get Moving Average
df["B"] = df["A"].rolling(window=2).mean()
print(df)
# Output
A B
0 1 NaN
1 2 1.5
2 3 2.5
3 4 3.5
4 5 4.5
# Try to Get Moving Sum
df["C"] = df["A"].rolling(window=2).sum()
print(df)
# Output
A B C
0 1 NaN NaN
1 2 1.5 3.0
2 3 2.5 5.0
3 4 3.5 7.0
4 5 4.5 9.0
Expanding & Cumulative Functions
👉 Includes all previous rows till current row.
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 4, 5]})
# Try to Get Expanding Sum
df["B"] = df["A"].expanding().sum()
print(df)
# Output
A B
0 1 1
1 2 3
2 3 6
3 4 10
4 5 15
# Try to Get Cumulative Sum
df["C"] = df["A"].cumsum()
print(df)
# Output
A B C
0 1 1 1
1 2 3 3
2 3 6 6
3 4 10 10
4 5 15 15
# Try to Get Cumulative Product
df["D"] = df["A"].cumprod()
print(df)
# Output
A B C D
0 1 1 1 1
1 2 3 3 2
2 3 6 6 6
3 4 10 10 24
4 5 15 15 120
Categorical Data
👉 Useful when a column has limited categories (e.g., gender, grades, regions). It reduces memory usage and speeds up operations.
df = pd.DataFrame({
"Grade": ["A","B","A","C","B","A"]
})
# Convert to category dtype
df["Grade"] = df["Grade"].astype("category")
print(df["Grade"].dtype)
# Reordering categories
df["Grade"] = df["Grade"].cat.reorder_categories(["C","B","A"], ordered=True)
print(df["Grade"].cat.codes) # numeric codes for analysis