Pandas Categorical Data

Quiz on AI Interviews Prep Live Training Corporate Training

Pandas Categorical Data.

Categorical are a pandas data type that corresponds to the categorical variables in statistics. Such variables take on a fixed and limited number of possible values.
For examples – grades, gender, blood group type etc.
Pandas provides Categorical data type (pd.Categorical) to optimize memory usage and improve performance when dealing with repetitive text-based data.

Why Use Categorical Data?
Saves memory by storing categories as integer codes instead of strings. Faster operations like sorting, filtering, and grouping compared to object dtype. Provides order to categorical values.

Creating Categorical Data

Converting an Existing Column

    import pandas as pd


    df = pd.DataFrame({
    'Category': ['A', 'B', 'A', 'C', 'B', 'A']
    })


    df['Category'] = df['Category'].astype('category')

    print(df.dtypes)

✅ The Category column is now of type category, reducing memory usage.

Creating from Scratch

cat_series = pd.Categorical(['red', 'blue', 'green', 'red', 'blue'])
print(cat_series)

Categorical Data with Defined Categories

    categories = ['small', 'medium', 'large']
sizes = pd.Categorical(['small', 'large', 'medium', 'small'], categories=categories, ordered=True)
print(sizes)

✅ Using ordered=True allows comparison (small < medium < large).

Operations on Categorical Data

Accessing Categories & Codes

print(sizes.categories)  # ['small', 'medium', 'large']
print(sizes.codes)       # [0, 2, 1, 0] -> Internal integer representation

Sorting

sorted_sizes = sizes.sort_values()
print(sorted_sizes)

Filtering

filtered_sizes = sizes[sizes > 'small']  # Keeps 'medium' and 'large'
print(filtered_sizes)

Changing Categories

sizes = sizes.rename_categories(['S', 'M', 'L'])
print(sizes)

✅ Renames 'small' → 'S', 'medium' → 'M', etc.

Adding & Removing Categories

sizes = sizes.add_categories(['extra-large'])
sizes = sizes.remove_categories(['small'])
print(sizes)

Use Case: Grouping & Aggregation

df = pd.DataFrame({
'Size': pd.Categorical(['small', 'large', 'medium', 'small', 'large'],
categories=['small', 'medium', 'large'], ordered=True),
'Price': [10, 30, 20, 15, 35]
})

grouped = df.groupby('Size').mean()
print(grouped)

✅ Efficient grouping with meaningful category order.

When to Use?
Use categorical data when:
The column contains a fixed number of possible values (e.g., gender, product sizes, regions).
You need ordered categories (e.g., low < medium < high).
Memory efficiency and performance improvements matter.