I have a some large files (more than 30gb) with pieces of information which I need to do some calculations on, like averaging. The pieces I mention are the slices of file, and I know the beginning line numbers and count of following lines for each slice.
So I have a dictionary with keys as beginning line numbers and values as count of following rows, and I use this dictionary to loop through the file and get slices over it. for each slice, I create a table, make some conversions and averaging, create a new table an convert it into a dictionary. I use islice for slicing and pandas dataframe to create tables from each slice.
however, in time process is getting slower and slower, even the size of the slices are more or less the same. First 1k slices - processed in 1h Second 1k slices - processed in 4h Third 1k slices - processed in 8h Second 1k slices - processed in 17h And I am waiting for days to complete the processes.
Right now I am doing this on a windows 10 machine, 1tb SSD, 32 GB ram. Previously I also tried on a Linux machine (ubuntu 18.4) with 250gb SSD and 8gb ram + 8gb virtual ram. Both resulted more or less the same.
What I noticed in windows is, 17% of CPU and 11% of memory is being used, but disk usage is 100%. I do not fully know what disk usage means and how I can improve it.
As a part of the code I was also importing data into mongodb while working on Linux, and I thought maybe it was because of indexing in mongodb. but when I print the processing time and import time I noticed that almost all time is spent on processing, import takes few seconds.
Also to gain time, I am now doing the processing part on a stronger windows machine and writing the docs as txt files. I expect that writing on disk slows down the process a bit but txt file sizes are not more than 600kb.
Below is the piece of code, how I read the file:
with open(infile) as inp:
for i in range(0,len(seg_ids)):
inp.seek(0)
segment_slice = islice(inp,list(seg_ids.keys())[i], (list(seg_ids.keys())[i]+list(seg_ids.values())[i]+1))
segment = list(segment_slice)
for _, line in enumerate(segment[1:]):
#create dataframe and perform calculations
So I want to learn if there is a way to improve the processing time. I suppose my code reads whole file from beginning for each slice, and going through the end of the file reading time goes longer and longer.
As a note, because of the time constraints, I started with the most important slices I have to process first. So the rest will be more random slices on the files. So solution should be applicable for random slices, if there are any (I hope).
I am not experienced in scripting so please forgive me if I am asking a silly question, but I really could not find any answer.
The problem here is you are rereading a HUGE unindexed file multiple times line by line from the start of the file. No wonder this takes HOURS.
Each islice
in your code is starting from the very beginning of the file -- every time -- and reads and discards all the lines from the file before it even reaches the start of the data of interest. That is very slow and inefficient.
The solution is to create a poor man's index to that file and then read smaller chucks for each slice.
Let's create a test file:
from pathlib import Path
p=Path('/tmp/file')
with open(p, 'w') as f:
for i in range(1024*1024*500):
f.write(f'{i+1}\n')
print(f'{p} is {p.stat().st_size/(1024**3):.2f} GB')
That creates a file of roughly 4.78 GB. Not as large as 30 GB but big enough to be SLOW if you are not thoughtful.
Now try reading that entire file line-by-line with the Unix utility wc
to count the total lines (wc
being the fastest way to count lines, in general):
$ time wc -l /tmp/file
524288000 file
real 0m3.088s
user 0m2.452s
sys 0m0.636s
Compare that with the speed for Python3 to read the file line by line and print the total:
$ time python3 -c 'with open("file","r") as f: print(sum(1 for l in f))'
524288000
real 0m53.880s
user 0m52.849s
sys 0m0.940s
Python is almost 18x slower to read the file line by line than wc
.
Now do a further comparison. Check out the speed of the Unix utility tail
to print the last n
lines of a file:
$ time tail -n 3 file
524287998
524287999
524288000
real 0m0.007s
user 0m0.003s
sys 0m0.004s
The tail
utility is 445x faster than wc
(and roughly 8,000x faster than Python) to get to the last three lines of the file because it is using a windowed index buffer. ie, tail
reads some number of bytes at the end of the file and then gets the last n
lines from the buffer it read.
It is possible to use the same tail
approach to your application.
Consider this photo:
The approach you are using the equivalent to reading every tape on that rack to find the data that is on two of the middle tapes only -- and doing it over and over and over...
In the 1950's (the era of the photo) each tape was roughly indexed for what it held. Computers would call for a specific tape in the rack -- not ALL the tapes in the rack.
The solution to your issue (in oversight) is to build a tape-like indexing scheme:
int.seek(0)
before every read, you would seek to the block that contains the line number of interest (just like tail
does) and then use islice
offset adjusted to where that block's starting line number is in relationship to the start of the file.x,y,z,...
to read data larger than one tape could hold. You only need to find x
which contains the starting line number of interest.Correctly implemented (and this is not terribly hard to do) it would speed up the read performance by 100x or more.
Constructing a useful index to a text file by line offset might be as easy as something like this:
def index_file(p, delimiter=b'\n', block_size=1024*1024):
index={0:0}
total_lines, cnt=(0,0)
with open(p, 'rb') as f:
while buf:=f.raw.read(block_size):
cnt=buf.count(delimiter)
idx=buf.rfind(delimiter)
key=cnt+total_lines
index[key]=f.tell()-(len(buf)-idx)+len(delimiter)
total_lines+=cnt
return index
# this index is created in about 4.9 seconds on my computer...
# with a 4.8 GB file, there are ~4,800 index entries
That constructs an index correlating starting line number (in that block) with byte offset from the beginning of the file:
>>> idx=index_file(p)
>>> idx
{0: 0, 165668: 1048571, 315465: 2097150, 465261: 3145722,
...
524179347: 5130682368, 524284204: 5131730938, 524288000: 5131768898}
Then if you want access to lines[524179347:524179500]
you don't need to read 4.5 GB to get there; you can just do f.seek(5130682368)
and start reading instantly.
A couple of things come to mind.
First, if you bring the data into a pandas DataFrame there is a 'chunksize' argument for importing large data. It allows you to process / dump what you need / don't, while proving information such as df.describe
which will give you summary stats.
Also, I hear great things about dask. It is a scalable platform via parallel, multi-core, multi-machine processing and is almost as simple as using pandas and numpy with very little management of resources required.
Use pandas or dask and pay attention to the options for read_csv(). mainly: chunck_size, nrows, skiprows, usecols, engine (use C), low_memory, memory_map
[https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html][1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With