I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long. The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows: <pre class="prettyprint"><code>file1='/Users/Shared/SmallSetbee.txt' newfile=open(file1, 'rb') reader=csv.DictReader((line.replace('\0','') for line in newfile), delimiter="\t") </code></pre> I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file. Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?

A <code>collections.deque</code> is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the <code>deque</code> and it will handle throwing away complete ones already. <pre class="prettyprint"><code>dq = collections.deque(maxlen=50000) with open(...) as csv_file: reader = csv.DictReader((line.replace("\0", "") for line in csv_file), delimiter="\t") # initial fill for _ in range(50000): dq.append(reader.next()) # repeated compute try: while 1: compute(dq) for _ in range(10000): dq.append(reader.next()) except StopIteration: compute(dq) </code></pre>

Processing a large .txt file in python efficiently

Tags:

python

tabs

sliding-window

delimited

I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.

The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:

file1='/Users/Shared/SmallSetbee.txt'
newfile=open(file1, 'rb')
reader=csv.DictReader((line.replace('\0','') for line in newfile), delimiter="\t")

I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.

Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?

373

asked Nov 15 '12 16:11

flz416

2 Answers

A collections.deque is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque and it will handle throwing away complete ones already.

dq = collections.deque(maxlen=50000)
with open(...) as csv_file:
    reader = csv.DictReader((line.replace("\0", "") for line in csv_file), delimiter="\t")

    # initial fill
    for _ in range(50000):
        dq.append(reader.next())

    # repeated compute
    try:
        while 1:
            compute(dq)
            for _ in range(10000):
                dq.append(reader.next())
    except StopIteration:
            compute(dq)

answered Sep 30 '22 15:09

Katriel

Don't use csv.DictReader, instead use csv.reader. It takes longer to create a dictionary for each row than it takes to create a list for each row. Additionally, it is marginally faster to access a list by an index than it is to access a dictionary by a key.

I timed iteration over a 300,000 line 4 column csv file using the two csv readers. csv.DictReader took seven times longer than a csv.reader.

Combine this with katrielalex's suggestion to use collections.deque and you should see a nice speedup.

Additionally, profile your code to pinpoint where you are spending most of your time.

answered Sep 30 '22 14:09

Steven Rumbalski

Related questions
                            
                                Multiple lines user input in command-line Python application
                            
                                Dynamically building a Boolean expression
                            
                                Memory Usage in Python: What's the difference between memory_profiler and guppy?
                            
                                Unicode error Ordinal not in range
                            
                                Patching a function with Mock only for one module?
                            
                                Python date iso8601 format with timezone designator
                            
                                pyPandas functionality request: reverse/negative df.drop
                            
                                Launch android app from SL4A script?
                            
                                Sending Arrow Keys to Popen
                            
                                How to set up Django models with two types of users with very different attributes
                            
                                Tastypie: How can I fill the resource without database?
                            
                                How can I replace simplejson with json in django python?
                            
                                Using Python to generate a connection/network graph
                            
                                How do I create a Tiling layout / Flow layout in TkInter?
                            
                                How do I remove the last character of an R-T-L string in python?
                            
                                Multiplying Rows and Columns of Python Sparse Matrix by elements in an Array
                            
                                How does matplotlib's histogramdd work?
                            
                                shell scripting checking python version
                            
                                Execute file in ipython interpreter
                            
                                Displaying multiple masks in different colours in pylab

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With