Fastest Way To Run Through 50k Lines of Excel File in OpenPYXL

Tags:

I'm using openpyxl in python, and I'm trying to run through 50k lines and grab data from each row and place it into a file. However.. what I'm finding is it runs incredibely slow the farther I get into it. The first 1k lines goes super fast, less than a minute, but after that it takes longer and longer and longer to do the next 1k lines.

I was opening a .xlsx file. I wonder if it is faster to open a .txt file as a csv or something or to read a json file or something? Or to convert somehow to something that will read faster?

I have 20 unique values in a given column, and then values are random for each value. I'm trying to grab a string of the entire unique value column for each value.

Value1: 1243,345,34,124, Value2: 1243,345,34,124, etc, etc

I'm running through the Value list, seeing if the name exists in a file, if it does, then it will access that file and append to it the new value, if the file doesn't exist, it will create the file and then set it to append. I have a dictionary that has all the "append write file" things connected to it, so anytime I want to write something, it will grab the file name, and the append thing will be available in the dict, it will look it up and write to that file, so it doesn't keep opening new files everytime it runs.

The first 1k took less than a minute.. now I'm on 4k to 5k records, and it's running all ready 5 minutes.. it seems to take longer as it goes up in records, I wonder how to speed it up. It's not printing to the console at all.

writeFile = 1
theDict = {}

for row in ws.iter_rows(rowRange):
    for cell in row:
        #grabbing the value
        theStringValueLocation = "B" + str(counter)
        theValue = ws[theStringValueLocation].value
        theName = cell.value
        textfilename = theName + ".txt"

        if os.path.isfile(textfilename):
            listToAddTo = theDict[theName]
            listToAddTo.write("," + theValue)
            if counter == 1000:
                print "1000"
                st = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')

        else:
            writeFileName = open(textfilename, 'w')
            writeFileName.write(theValue)
            writeFileName = open(textfilename, 'a')
            theDict[theName] = writeFileName
        counter = counter + 1

I added some time stamps to the above code, it is not there, but you can see the output below. The problem I'm seeing is that it is going up higher and higher each 1k run. 2 minutes the firs ttime, thne 3 minutes, then 5 minutes, then 7 minutes. By the time it hits 50k, I'm worried it's going to be taking an hour or something and it will be taking too long.

1000
2016-02-25 15:15:08
20002016-02-25 15:17:07
30002016-02-25 15:20:52
2016-02-25 15:25:28
4000
2016-02-25 15:32:00
5000
2016-02-25 15:40:02
6000
2016-02-25 15:51:34
7000
2016-02-25 16:03:29
8000
2016-02-25 16:18:52
9000
2016-02-25 16:35:30
10000

Somethings I should make clear.. I don't know the names of the values ahead of time, maybe I should run through and grab those in a seperate python script to make this go faster?

Second, I need a string of all values seperated by comma, that's why I put it into a text file to grab later. I was thinking of doing it by a list as was suggested to me, but I'm wondering if that will have the same problem. I'm thinking the problem has to do with reading off excel. Anyway I can get a string out of it seperated by comma, I can do it another way.

Or maybe I could do try/catch instead of searching for the file everytime, and if there is an error, I can assume to create a new file? Maybe the lookup everytime is making it go really slow? the If the file exists?

this question is a continuation from my original here and I took some suggestions from there.... What is the fastest performance tuple for large data sets in python?

854

asked Feb 25 '16 20:02

king

2 Answers

I think what you're trying to do is get a key out of column B of the row, and use that for the filename to append to. Let's speed it up a lot:

from collections import defaultdict
Value_entries = defaultdict(list) # dict of lists of row data

for row in ws.iter_rows(rowRange):
    key = row[1].value

    Value_entries[key].extend([cell.value for cell in row])

# All done. Now write files:
for key in Value_entries.keys():
    with open(key + '.txt', 'w') as f:
        f.write(','.join(Value_entries[key]))

answered Sep 22 '22 00:09

aghast

It looks like you only want cells from the B-column. In this case you can use ws.get_squared_range() to restrict the number of cells to look at.

for row in ws.get_squared_range(min_col=2, max_col=2, min_row=1, max_row=ws.max_row):
    for cell in row: # each row is always a sequence
         filename = cell.value
         if os.path.isfilename(filename):
              …

It's not clear what's happening with the else branch of your code but you should probably be closing any files you open as soon as you have finished with them.

answered Sep 21 '22 00:09

Charlie Clark

Related questions
                            
                                Theano import error: No module named cPickle
                            
                                How to remove the effects of a decorator while testing in python? [duplicate]
                            
                                How do I create a deterministic Random number generator with numpy seed?
                            
                                Miniconda "installs" numpy but Python can't import it
                            
                                How does python implement mutual recursion?
                            
                                Nested List comprehension in Python
                            
                                How to use mock.ANY with assert_called_with
                            
                                Python statsmodels ARIMA Forecast
                            
                                Tensor Flow Explicit Device Requirement Error
                            
                                Topic modelling - Assign a document with top 2 topics as category label - sklearn Latent Dirichlet Allocation
                            
                                Argparse and ArgumentDefaultsHelpFormatter. Formatting of default values when sys.stdin/stdout are selected as default
                            
                                Matplotlib - Boxplot calculated on log10 values but shown in logarithmic scale
                            
                                Python legend in 3dplot
                            
                                How to setup remote debugging with Eclipse and PyDev
                            
                                tkinter - How to stop frame changing size when widget is added?
                            
                                Converting trained Tensorflow model to protobuf
                            
                                Python requests sometimes freezes
                            
                                ipython notebook listening on all IP addresses?
                            
                                What are the rules for comparing numpy arrays using ==?
                            
                                Scipy minimize constrained function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest Way To Run Through 50k Lines of Excel File in OpenPYXL

Tags:

performance

python

excel

openpyxl

king

People also ask

2 Answers

aghast

Charlie Clark

Recent Activity

Donate For Us