I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.
The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:
file1='/Users/Shared/SmallSetbee.txt'
newfile=open(file1, 'rb')
reader=csv.DictReader((line.replace('\0','') for line in newfile), delimiter="\t")
I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.
Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?
Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.
We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is essentially the same as the read_csv() function but with the delimiter = '\t', instead of a comma by default.
A collections.deque
is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque
and it will handle throwing away complete ones already.
dq = collections.deque(maxlen=50000)
with open(...) as csv_file:
reader = csv.DictReader((line.replace("\0", "") for line in csv_file), delimiter="\t")
# initial fill
for _ in range(50000):
dq.append(reader.next())
# repeated compute
try:
while 1:
compute(dq)
for _ in range(10000):
dq.append(reader.next())
except StopIteration:
compute(dq)
Don't use csv.DictReader
, instead use csv.reader
. It takes longer to create a dictionary for each row than it takes to create a list for each row. Additionally, it is marginally faster to access a list by an index than it is to access a dictionary by a key.
I timed iteration over a 300,000 line 4 column csv file using the two csv readers. csv.DictReader
took seven times longer than a csv.reader
.
Combine this with katrielalex's suggestion to use collections.deque
and you should see a nice speedup.
Additionally, profile your code to pinpoint where you are spending most of your time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With