Pandas is a powerful python library for data manipulation and analysis. I’ve learned some interesting pandas techniques that I thought is worth documenting for others and also as a personal learning.

Working with parquet files

Important to install right dependencies like pyarrow or fastparquet for handling such files in pandas.

Monitoring loading of large data files

Integrating tqdm in the code to see a progress bar
Using tqdm with pandas for progress tracking especially when iterating over row or performing transformations on large dataframes.


from tqdm import tqdm
import pandas as pd

# Enable tqdm for pandas
tqdm.pandas()

file_path = "large_data.csv"
chunk_size = 100000  # Process in chunks if needed

# Process with a progress bar
df = pd.concat(
    pd.read_csv(file_path, chunksize=chunk_size, iterator=True).progress_apply(lambda x: x)
)
print(df.info())

Using tqdm for DataFrame iteration

Iterating through rows or applying transformations on large datasets can be monitored via a progress bar using tqdm.

Grouping without aggregation

This one is relatively simple concept but handy to know.
Pandas makes it easy to group data by multiple columns w/o applying any aggregation using groupby() and as_index=False.
For example:


grouped_df = df.groupby(['column1', 'column2'], as_index=False).apply(lambda x: x)