Pandas Enhancing Performance

Pandas Enhancing Performance.

Improving Pandas performance is essential for handling large datasets efficiently.
Here are some key optimization techniques:
🚀 1. Optimize Data Types
Using the correct data types reduces memory usage and speeds up operations.

import pandas as pd
import numpy as np

# Sample Data
df = pd.DataFrame({
'id': np.arange(1, 1000000, dtype=np.int32),
'category': np.random.choice(['A', 'B', 'C'], 1000000),
'value': np.random.rand(1000000) * 100
})

# Convert category to categorical type
df['category'] = df['category'].astype('category')

# Convert float64 to float32
df['value'] = df['value'].astype(np.float32)

print(df.info())  # Check memory usage

✅ Best Practice: Convert columns to category, int8, float16, etc.

⚡ 2. Avoid Loops – Use Vectorized Operations
Pandas is optimized for vectorized operations with NumPy.
❌ Slow: Using apply()

df['value_squared'] = df['value'].apply(lambda x: x ** 2)
✅ Fast: Vectorized Approach
    
df['value_squared'] = df['value'] ** 2

📂 3. Use Efficient File Formats Instead of CSV, use Parquet or Feather, which are faster and more memory-efficient.

df.to_parquet("data.parquet")  # Save
df = pd.read_parquet("data.parquet")  # Load

🔥 4. Efficient Filtering & Selection Avoid using .loc[] unnecessarily.
❌ Slow:

df.loc[df['value'] > 50]

✅ Fast:

df[df['value'] > 50]

🏎 5. Optimize Merging & Joining merge() is faster than join() for large datasets.

df1 = pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})
df2 = pd.DataFrame({'id': [1, 2, 3], 'score': [100, 200, 300]})

df_merged = df1.merge(df2, on='id', how='left')  # Efficient merging

📊 6. Process Large Files in Chunks For large files, read in chunks instead of loading everything at once.

chunk_size = 10000
chunks = pd.read_csv("large_file.csv", chunksize=chunk_size)

for chunk in chunks:
process(chunk)  # Process each chunk separately

🔄 7. Use numba for Fast Computation Use numba for performance-critical calculations.

    from numba import jit

    @jit(nopython = True)
    def fast_function(x):
    return x ** 2

    df['fast_value'] = fast_function(df['value'].values)  # Much faster!

🏋️ 8. Use Multi-Processing with dask For very large datasets, use dask instead of Pandas.

    
    import dask.dataframe as dd

    ddf = dd.read_csv("large_file.csv")  # Dask DataFrame (lazy loading)
    ddf.compute()  # Convert to Pandas when needed

✅ Summary: Key Performance Tips
✔ Convert columns to smaller data types (category, int8, float16)
✔ Use vectorized operations instead of loops
✔ Read/write in Parquet or Feather (avoid CSV for big data)
✔ Use merge() instead of join() for large datasets
✔ Process large files in chunks
✔ Speed up calculations with numba
✔ Use Dask for parallel processing on large datasets