Improving Pandas performance is essential for handling large datasets efficiently.
Here are some key optimization techniques:
🚀 1. Optimize Data Types
Using the correct data types reduces memory usage and speeds up operations.
import pandas as pd
import numpy as np
# Sample Data
df = pd.DataFrame({
'id': np.arange(1, 1000000, dtype=np.int32),
'category': np.random.choice(['A', 'B', 'C'], 1000000),
'value': np.random.rand(1000000) * 100
})
# Convert category to categorical type
df['category'] = df['category'].astype('category')
# Convert float64 to float32
df['value'] = df['value'].astype(np.float32)
print(df.info()) # Check memory usage
✅ Best Practice: Convert columns to category, int8, float16, etc.
⚡ 2. Avoid Loops – Use Vectorized Operations
Pandas is optimized for vectorized operations with NumPy.
❌ Slow: Using apply()
df['value_squared'] = df['value'].apply(lambda x: x ** 2)
✅ Fast: Vectorized Approach
df['value_squared'] = df['value'] ** 2
📂 3. Use Efficient File Formats
Instead of CSV, use Parquet or Feather, which are faster and more memory-efficient.
df.to_parquet("data.parquet") # Save
df = pd.read_parquet("data.parquet") # Load
🔥 4. Efficient Filtering & Selection
Avoid using .loc[] unnecessarily.
❌ Slow:
df.loc[df['value'] > 50]
✅ Fast:
df[df['value'] > 50]
🏎 5. Optimize Merging & Joining
merge() is faster than join() for large datasets.
df1 = pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})
df2 = pd.DataFrame({'id': [1, 2, 3], 'score': [100, 200, 300]})
df_merged = df1.merge(df2, on='id', how='left') # Efficient merging
📊 6. Process Large Files in Chunks
For large files, read in chunks instead of loading everything at once.
chunk_size = 10000
chunks = pd.read_csv("large_file.csv", chunksize=chunk_size)
for chunk in chunks:
process(chunk) # Process each chunk separately
🔄 7. Use numba for Fast Computation
Use numba for performance-critical calculations.
from numba import jit
@jit(nopython = True)
def fast_function(x):
return x ** 2
df['fast_value'] = fast_function(df['value'].values) # Much faster!
🏋️ 8. Use Multi-Processing with dask
For very large datasets, use dask instead of Pandas.
import dask.dataframe as dd
ddf = dd.read_csv("large_file.csv") # Dask DataFrame (lazy loading)
ddf.compute() # Convert to Pandas when needed
✅ Summary: Key Performance Tips
✔ Convert columns to smaller data types (category, int8, float16)
✔ Use vectorized operations instead of loops
✔ Read/write in Parquet or Feather (avoid CSV for big data)
✔ Use merge() instead of join() for large datasets
✔ Process large files in chunks
✔ Speed up calculations with numba
✔ Use Dask for parallel processing on large datasets