Pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky. Even datasets that are a sizable fraction of memory become unwieldy, as some pandas operations need to make intermediate copies.
When working with large datasets in pandas, performance can become a major bottleneck. Here are some strategies to scale pandas effectively:
-
Use Efficient Data Types
-
Convert object columns to categorical or string
-
Use float32/int32 instead of float64/int64 if possible
-
Use pd.to_datetime() for dates to optimize performance
df['category_column'] = df['category_column'].astype('category')
df['int_column'] = df['int_column'].astype('int32')
-
Load Data Efficiently
a) Read Only Required Columns
df = pd.read_csv("large_file.csv", usecols=['col1', 'col2'])
b) Load Data in Chunks
chunksize = 100000 # Process 100K rows at a time
for chunk in pd.read_csv("large_file.csv", chunksize=chunksize):
process(chunk)
c) Use dtype Parameter to Reduce Memory
df = pd.read_csv("large_file.csv", dtype={'col1': 'int32', 'col2': 'float32'})
-
Use modin.pandas for Parallel Processing
Replace:
import pandas as pd
With:
import modin.pandas as pd
This will automatically speed up operations by using multiple CPU cores.
-
Use Dask for Out-of-Core Processing
Dask allows handling large data without loading everything into memory.
import dask.dataframe as dd
ddf = dd.read_csv("large_file.csv")
ddf.groupby('col1').mean().compute() # Compute when needed
-
Use SQLite or Parquet Instead of CSV
a) Store & Query Large Datasets with SQLite
import sqlite3
conn = sqlite3.connect("database.db")
df = pd.read_sql_query("SELECT * FROM table_name", conn)
b) Use Parquet for Faster I/O
df.to_parquet("large_data.parquet")
df = pd.read_parquet("large_data.parquet")
-
Vectorized Operations & NumPy
Always prefer vectorized operations over loops.
Bad:
df['new_col'] = df['col1'] * 2 # OK
for i in range(len(df)): # Slow
df.loc[i, 'new_col'] = df.loc[i, 'col1'] * 2
Better:
df['new_col'] = df['col1'].mul(2) # Faster
-
Drop Unused Data (Garbage Collection)
import gc
del df
gc.collect()
-
Cloud & Distributed Solutions
-
Google BigQuery + pandas: pandas-gbq
-
Spark with pandas: pyspark.pandas
-
Snowflake, DuckDB, ClickHouse for large-scale data processing
Final Thought
For datasets < 1GB, optimize pandas with modin, dtypes, and vectorization.
For datasets > 1GB, use Dask, SQLite, or Spark.