Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance when reading a file line by line vs reading the whole file

Is there a noticeable difference (in theory) when reading a while line by line compared to reading the whole file in one go?

Reading the whole file does have a negative impact on the amount of memory used but does it work faster?

I need to read a file and process each line. I don't know whether I should read one line at a time and process it, or read the whole file, process all, then write to output.

I've already setup the prgm to read line by line and I want to know whether it is worth the effort to change it to read the whole file (not easy given my setup).

Thanks,

like image 212
RushK Avatar asked Oct 10 '11 03:10

RushK


3 Answers

Reading the whole file will be slightly faster -- but not much!

But be careful reading the whole file is not scalable as you are limited by the available memory in the system, once the file size exceeds the size of RAM avaibale to your program it will start using swap space will be much slower. If the file size exceeds the size of virtual memory available then your program will crash.

like image 121
James Anderson Avatar answered Nov 15 '22 12:11

James Anderson


Like others, I believe doing bigger reads will improve the performance of your application some, but don't expect miracles, I/O is already buffered at the OS layer, so you'll only be gaining by reducing the overhead of having too many read calls. Reading the whole file in one go is dangerous, unless you know the maximum possible size for your input files. A most reasonable approach is to read the file in large blocks.

If you wanted to improve even more, you should consider overlapping the I/O with the processing. Let's say you read the input file in blocks of 128MB. On your main thread you read the first 128MB block and then pass it on to a worker thread for processing. While the worker thread gets to work the main thread reads the second 128MB block. From that point on, while the worker thread is processing block N, the main thread is reading block N+1 from disk.

like image 45
Miguel Avatar answered Nov 15 '22 13:11

Miguel


I think it would depend on the needs of your application (like most things, I know). Reading a 1 MB file in Node js is ~3-4x faster with fs.readFile() than using a readable stream or line reader as far as just file reading goes. Streams may offer some additional performance if the file is very large and you are processing input on the fly. It may also be ideal if your application is already consuming a lot of memory as a Node process has a ~1.5 GB memory limit on 64 bit systems. Processing chunks as they come in may also be more performant if the source of the data is slow relative to how fast the cpu can process it (archives on HDD or tape, network connections like TCP). As far as reading a file into memory vs. streaming it into memory, I am guessing the function call overhead of emitting data events and switching to the processing function callback slow down the process.

like image 33
user137717 Avatar answered Nov 15 '22 12:11

user137717