Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Erlang File I/O: Large binary files and gzip streaming

I have two questions regarding Erlang file i/o; what is the best way to achieve in Erlang:

  1. reading large binary files (many gigabytes) without copying the whole file into memory
  2. reading a gzipped binary file as a decompressed stream

Thanks!

like image 797
Erlang Avatar asked Sep 27 '10 20:09

Erlang


People also ask

What is a raw filename in Erlang?

The concept of raw filenames is introduced to handle file systems with inconsistent naming when running in utf8 mode. A raw filename is a filename specified as a binary. The Erlang VM does not translate a filename specified as a binary on systems with transparent naming.

How do I read a file in Erlang?

Searches the path Path (a list of directory names) until the file Filename is found. If Filename is an absolute filename, Path is ignored. Then reads Erlang terms, separated by '.', from the file. Returns one of the following: The file is successfully read. FullName is the full name of the file.

Should binary data be shared in Erlang?

If it would be shared, the functional properties (also called referential transparency) of Erlang would break. Appending to a binary or bitstring is specially optimized by the runtime system:

What is the difference between compress and uncompress in Erlang?

For example, if compress is set to ["gif", "jpg"] and uncompress is set to ["jpg"], only files with extension "gif" are compressed. By default, this function opens the zip file in mode raw, which is faster but does not allow a remote (Erlang) file server to be used.


2 Answers

  1. See file:read/2 for sequential block access and file:pread/2,3 for random access.
  2. See compressed option in file:open/2.
like image 90
Hynek -Pichi- Vychodil Avatar answered Sep 20 '22 02:09

Hynek -Pichi- Vychodil


According to my experience, file:read/2 alone will be very slow if called frequently with small amounts of data, despite of read_ahead and raw. You must implement a binary buffer on top of that. If that is meant by block oriented processing then I agree.

I'm talking about runtimes of few hours (with file:read/2 only) vs. 2 minutes (with buffering implemented in pure Erlang).

Here are my measurements for reading a few 10 bytes at once:

%% Bufsize vs. runtime [ns]
%% 50       169369703
%% 100      118288832
%% 1000      70187233
%% 10000     64615506
%% 100000    65087411
%% 1000000   64747497

In this example, the performance doesn't really increase over 10 KB buffer size, because the relative overhead for file:read becomes small enough.

like image 36
tzp Avatar answered Sep 19 '22 02:09

tzp