Advance Data Manipulation

Last Updated : 31th August 2025


MultiIndexing

Used to access data in a DataFrame or Series based on multiple index levels.

Creating a MultiIndex From DataFrame

import pandas as pd

data = {
    "Department": ["IT", "IT", "HR", "HR", "Finance", "Finance"],
    "Employee": ["Amit", "Rahul", "Priya", "Karan", "Sneha", "Megha"],
    "Salary": [50000, 52000, 45000, 47000, 60000, 58000]
}

df = pd.DataFrame(data)
# Set MultiIndex
df_multi = df.set_index(["Department", "Employee"])
print(df_multi)

# Access data
print(df_multi.loc["IT"])
print(df_multi.loc[("HR","Priya")])

Creating MultiIndex From array

import pandas as pd

# Example sales data by region & product
arrays = [
    ["East", "East", "West", "West"],
    ["Pen", "Book", "Pen", "Book"]
]
index = pd.MultiIndex.from_arrays(arrays, names=("Region", "Product"))

df = pd.DataFrame({"Sales": [120, 200, 150, 300]}, index=index)
print(df)


# Access data
# Select sales for East region
print(df.loc["East"])

# Select only Pen sales across regions, .xs() (cross section) lets you filter by one index level.
print(df.xs("Pen", level="Product"))

Windows Functions

It allows you to perform complex calculations on a group of rows within a DataFrame or Series.

Rolling Windows

👉 Works on fixed window size.

import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4, 5]})

# Try to Get Moving Average
df["B"] = df["A"].rolling(window=2).mean()
print(df)

# Output
  A    B
0  1  NaN
1  2  1.5
2  3  2.5
3  4  3.5
4  5  4.5
# Try to Get Moving Sum
df["C"] = df["A"].rolling(window=2).sum()
print(df)

# Output
  A    B    C
0  1  NaN   NaN
1  2  1.5   3.0
2  3  2.5   5.0
3  4  3.5   7.0
4  5  4.5   9.0

Expanding & Cumulative Functions

👉 Includes all previous rows till current row.

import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4, 5]})

# Try to Get Expanding Sum
df["B"] = df["A"].expanding().sum()
print(df)

# Output
  A    B
0  1   1
1  2   3
2  3   6
3  4  10
4  5  15
# Try to Get Cumulative Sum
df["C"] = df["A"].cumsum()
print(df)

# Output
  A    B    C
0  1   1   1
1  2   3   3
2  3   6   6
3  4  10  10
4  5  15  15
# Try to Get Cumulative Product
df["D"] = df["A"].cumprod()
print(df)

# Output
  A    B    C    D
0  1   1   1   1
1  2   3   3   2
2  3   6   6   6
3  4  10  10  24
4  5  15  15  120

Categorical Data

👉 Useful when a column has limited categories (e.g., gender, grades, regions). It reduces memory usage and speeds up operations.

df = pd.DataFrame({
    "Grade": ["A","B","A","C","B","A"]
})

# Convert to category dtype
df["Grade"] = df["Grade"].astype("category")
print(df["Grade"].dtype)

# Reordering categories
df["Grade"] = df["Grade"].cat.reorder_categories(["C","B","A"], ordered=True)
print(df["Grade"].cat.codes)  # numeric codes for analysis