Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I make a progress bar for loading pandas DataFrame from a large xlsx file?

Tags:

from https://pypi.org/project/tqdm/:

import pandas as pd
import numpy as np
from tqdm import tqdm

df = pd.DataFrame(np.random.randint(0, 100, (100000, 6)))
tqdm.pandas(desc="my bar!")p`
df.progress_apply(lambda x: x**2)

I took this code and edited it so that I create a DataFrame from load_excel rather than using random numbers:

import pandas as pd
from tqdm import tqdm
import numpy as np

filename="huge_file.xlsx"
df = pd.DataFrame(pd.read_excel(filename))
tqdm.pandas()
df.progress_apply(lambda x: x**2)

This gave me an error, so I changed df.progress_apply to this:

df.progress_apply(lambda x: x)

Here is the final code:

import pandas as pd
from tqdm import tqdm
import numpy as np

filename="huge_file.xlsx"
df = pd.DataFrame(pd.read_excel(filename))
tqdm.pandas()
df.progress_apply(lambda x: x)

This results in a progress bar, but it doesn't actually show any progress, rather it loads the bar, and when the operation is done it jumps to 100%, defeating the purpose.

My question is this: How do I make this progress bar work?
What does the function inside of progress_apply actually do?
Is there a better approach? Maybe an alternative to tqdm?

Any help is greatly appreciated.

like image 421
user2303336 Avatar asked Sep 06 '18 17:09

user2303336


People also ask

How read data from xlsx file in pandas?

pandas. read_excel() function is used to read excel sheet with extension xlsx into pandas DataFrame. By reading a single sheet it returns a pandas DataFrame object, but reading two sheets it returns a Dict of DataFrame. Can load excel files stored in a local filesystem or from an URL.

Can pandas handle xlsx?

Read an Excel file into a pandas DataFrame. Supports xls , xlsx , xlsm , xlsb , odf , ods and odt file extensions read from a local filesystem or URL. Supports an option to read a single sheet or a list of sheets.

Is there a size limit for pandas DataFrame?

The short answer is yes, there is a size limit for pandas DataFrames, but it's so large you will likely never have to worry about it. The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.


2 Answers

Will not work. pd.read_excel blocks until the file is read, and there is no way to get information from this function about its progress during execution.

It would work for read operations which you can do chunk wise, like

chunks = []
for chunk in pd.read_csv(..., chunksize=1000):
    update_progressbar()
    chunks.append(chunk)

But as far as I understand tqdm also needs the number of chunks in advance, so for a propper progress report you would need to read the full file first....

like image 91
rocksportrocker Avatar answered Apr 27 '23 22:04

rocksportrocker


This might help for people with similar problem. here you can get help

for example:

for i in tqdm(range(0,3), ncols = 100, desc ="Loading data.."): 
    df=pd.read_excel("some_file.xlsx",header=None)
    LC_data=pd.read_excel("some_file.xlsx",'Sheet1', header=None)
    FC_data=pd.read_excel("some_file.xlsx",'Shee2', header=None)    
print("------Loading is completed ------")
like image 40
sardor mirzaev Avatar answered Apr 27 '23 22:04

sardor mirzaev