Pandas is a powerful python library for data manipulation and analysis. I’ve learned some interesting pandas techniques that I thought is worth documenting for others and also as a personal learning.

Working with parquet files

  • Important to install right dependencies like pyarrow or fastparquet for handling such files in pandas.

Monitoring loading of large data files

  • Integrating tqdm in the code to see a progress bar
  • Using tqdm with pandas for progress tracking especially when iterating over row or performing transformations on large dataframes.

from tqdm import tqdm
import pandas as pd

# Enable tqdm for pandas
tqdm.pandas()

file_path = "large_data.csv"
chunk_size = 100000  # Process in chunks if needed

# Process with a progress bar
df = pd.concat(
    pd.read_csv(file_path, chunksize=chunk_size, iterator=True).progress_apply(lambda x: x)
)
print(df.info())

Using tqdm for DataFrame iteration

  • Iterating through rows or applying transformations on large datasets can be monitored via a progress bar using tqdm.

Grouping without aggregation

  • This one is relatively simple concept but handy to know.
  • Pandas makes it easy to group data by multiple columns w/o applying any aggregation using groupby() and as_index=False.
  • For example:

grouped_df = df.groupby(['column1', 'column2'], as_index=False).apply(lambda x: x)