Pandas Categorical Data.
-
Categorical are a pandas data type that corresponds to the categorical variables in statistics. Such variables take on a fixed and limited number of possible values.
-
For examples – grades, gender, blood group type etc.
-
Pandas provides Categorical data type (pd.Categorical) to optimize memory usage and improve performance when dealing with repetitive text-based data.
Why Use Categorical Data?
Saves memory by storing categories as integer codes instead of strings.
Faster operations like sorting, filtering, and grouping compared to object dtype.
Provides order to categorical values.
Creating Categorical Data
-
Converting an Existing Column
import pandas as pd
df = pd.DataFrame({
'Category': ['A', 'B', 'A', 'C', 'B', 'A']
})
df['Category'] = df['Category'].astype('category')
print(df.dtypes)
✅ The Category column is now of type category, reducing memory usage.
-
Creating from Scratch
cat_series = pd.Categorical(['red', 'blue', 'green', 'red', 'blue'])
print(cat_series)
-
Categorical Data with Defined Categories
categories = ['small', 'medium', 'large']
sizes = pd.Categorical(['small', 'large', 'medium', 'small'], categories=categories, ordered=True)
print(sizes)
✅ Using ordered=True allows comparison (small < medium < large).
-
Operations on Categorical Data
-
Accessing Categories & Codes
print(sizes.categories) # ['small', 'medium', 'large']
print(sizes.codes) # [0, 2, 1, 0] -> Internal integer representation
-
Sorting
sorted_sizes = sizes.sort_values()
print(sorted_sizes)
-
Filtering
filtered_sizes = sizes[sizes > 'small'] # Keeps 'medium' and 'large'
print(filtered_sizes)
-
Changing Categories
sizes = sizes.rename_categories(['S', 'M', 'L'])
print(sizes)
✅ Renames 'small' → 'S', 'medium' → 'M', etc.
-
Adding & Removing Categories
sizes = sizes.add_categories(['extra-large'])
sizes = sizes.remove_categories(['small'])
print(sizes)
-
Use Case: Grouping & Aggregation
df = pd.DataFrame({
'Size': pd.Categorical(['small', 'large', 'medium', 'small', 'large'],
categories=['small', 'medium', 'large'], ordered=True),
'Price': [10, 30, 20, 15, 35]
})
grouped = df.groupby('Size').mean()
print(grouped)
✅ Efficient grouping with meaningful category order.
When to Use?
Use categorical data when:
The column contains a fixed number of possible values (e.g., gender, product sizes, regions).
You need ordered categories (e.g., low < medium < high).
Memory efficiency and performance improvements matter.