As the title suggests, I am trying to display a progress bar while performing pandas.to_csv.
I have the following script:
def filter_pileup(pileup, output, lists):
    tqdm.pandas(desc='Reading, filtering, exporting', bar_format=BAR_DEFAULT_VIEW)
    # Reading files
    pileup_df = pd.read_csv(pileup, '\t', header=None).progress_apply(lambda x: x)
    lists_df = pd.read_csv(lists, '\t', header=None).progress_apply(lambda x: x)
    # Filtering pileup
    intersection = pd.merge(pileup_df, lists_df, on=[0, 1]).progress_apply(lambda x: x)
    intersection.columns = [i for i in range(len(intersection.columns))]
    intersection = intersection.loc[:, 0:5]
    # Exporting filtered pileup
    intersection.to_csv(output, header=None, index=None, sep='\t')
On the first few lines I have found a way to integrate a progress bar but this method doesn't work for the last line, How can I achieve that?
Pandas DataFrame to_csv() function converts DataFrame into CSV data. We can pass a file object to write the CSV data into a file. Otherwise, the CSV data is returned in the string format.
Does pandas To_csv overwrite? If the file already exists, it will be overwritten. If no path is given, then the Frame will be serialized into a string, and that string will be returned.
tqdm instance is constructed, which trange does: trange(n) is shorthand for tqdm. tqdm(range(n)) . I'm not sure if there's any way around that. However, you can delay the construction by keeping a temporary range(n) object.
tqdm shows the progress bar, number of iterations, time taken to run the loop, and frequency of iterations per second. In this tutorial, we will learn to customize the bar and integrate it with the pandas dataframe.
You can divide the dataframe into chunks of n rows and save the dataframe to a csv chunk by chunk using mode='w' for the first row and mode="a" for the rest:
Example:
import numpy as np
import pandas as pd
from tqdm import tqdm
df = pd.DataFrame(data=[i for i in range(0, 10000000)], columns = ["integer"])
print(df.head(10))
chunks = np.array_split(df.index, 100) # chunks of 100 rows
for chunck, subset in enumerate(tqdm(chunks)):
    if chunck == 0: # first row
        df.loc[subset].to_csv('data.csv', mode='w', index=True)
    else:
        df.loc[subset].to_csv('data.csv', header=None, mode='a', index=True)
Output:
   integer
0        0
1        1
2        2
3        3
4        4
5        5
6        6
7        7
8        8
9        9
100%|██████████| 100/100 [00:12<00:00,  8.12it/s]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With