Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas to_csv progress bar with tqdm

As the title suggests, I am trying to display a progress bar while performing pandas.to_csv.
I have the following script:

def filter_pileup(pileup, output, lists):
    tqdm.pandas(desc='Reading, filtering, exporting', bar_format=BAR_DEFAULT_VIEW)
    # Reading files
    pileup_df = pd.read_csv(pileup, '\t', header=None).progress_apply(lambda x: x)
    lists_df = pd.read_csv(lists, '\t', header=None).progress_apply(lambda x: x)
    # Filtering pileup
    intersection = pd.merge(pileup_df, lists_df, on=[0, 1]).progress_apply(lambda x: x)
    intersection.columns = [i for i in range(len(intersection.columns))]
    intersection = intersection.loc[:, 0:5]
    # Exporting filtered pileup
    intersection.to_csv(output, header=None, index=None, sep='\t')

On the first few lines I have found a way to integrate a progress bar but this method doesn't work for the last line, How can I achieve that?

like image 507
Eliran Turgeman Avatar asked Nov 05 '20 10:11

Eliran Turgeman


People also ask

What does To_csv do in pandas?

Pandas DataFrame to_csv() function converts DataFrame into CSV data. We can pass a file object to write the CSV data into a file. Otherwise, the CSV data is returned in the string format.

Does To_csv overwrite?

Does pandas To_csv overwrite? If the file already exists, it will be overwritten. If no path is given, then the Frame will be serialized into a string, and that string will be returned.

What is Trange in tqdm?

tqdm instance is constructed, which trange does: trange(n) is shorthand for tqdm. tqdm(range(n)) . I'm not sure if there's any way around that. However, you can delay the construction by keeping a temporary range(n) object.

What does tqdm pandas do?

tqdm shows the progress bar, number of iterations, time taken to run the loop, and frequency of iterations per second. In this tutorial, we will learn to customize the bar and integrate it with the pandas dataframe.


1 Answers

You can divide the dataframe into chunks of n rows and save the dataframe to a csv chunk by chunk using mode='w' for the first row and mode="a" for the rest:

Example:

import numpy as np
import pandas as pd
from tqdm import tqdm

df = pd.DataFrame(data=[i for i in range(0, 10000000)], columns = ["integer"])

print(df.head(10))

chunks = np.array_split(df.index, 100) # chunks of 100 rows

for chunck, subset in enumerate(tqdm(chunks)):
    if chunck == 0: # first row
        df.loc[subset].to_csv('data.csv', mode='w', index=True)
    else:
        df.loc[subset].to_csv('data.csv', header=None, mode='a', index=True)

Output:

   integer
0        0
1        1
2        2
3        3
4        4
5        5
6        6
7        7
8        8
9        9

100%|██████████| 100/100 [00:12<00:00,  8.12it/s]
like image 116
Chicodelarose Avatar answered Oct 17 '22 15:10

Chicodelarose