How can you filter large CSV files in Python?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

How can you filter large CSV files in Python?

pythonsebvf
Filtering large CSV files in Python efficiently requires using memory-optimized techniques. Here are some effective methods:

1. Using pandas with chunksize (Best for Large Files)
The chunksize parameter processes the CSV file in smaller portions, avoiding memory overload.
import pandas as pd

# Define filter condition function
def filter_chunk(chunk):
    return chunk[chunk["column_name"] > 50]  # Example: Filter rows where column_name > 50

# Process in chunks and write to a new CSV
chunksize = 10000  # Adjust based on available memory
filtered_data = pd.concat(filter_chunk(chunk) for chunk in pd.read_csv("large_file.csv", chunksize=chunksize))

# Save filtered data
filtered_data.to_csv("filtered_file.csv", index=False)


2. Using csv Module (Lightweight & Faster for Simple Filtering)
The built-in csv module reads files line by line, reducing memory consumption.
import csv

input_file = "large_file.csv"
output_file = "filtered_file.csv"

with open(input_file, mode="r", newline="") as infile, open(output_file, mode="w", newline="") as outfile:
    reader = csv.DictReader(infile)
    writer = csv.DictWriter(outfile, fieldnames=reader.fieldnames)
   
    writer.writeheader()  # Write column headers
    for row in reader:
        if int(row["column_name"]) > 50:  # Example filter condition
            writer.writerow(row)

Python Course in Pune