I have data stored on disk in files that are far too big to store in main memory.
I want to stream this data from the disk into a data processing pipeline via iconv
, like this:
zcat myfile | iconv -f L1 -t UTF-8 | # rest of the pipeline goes here
Unfortunately, I'm seeing iconv buffer the entire file in memory until it's exhausted before outputting any data. This means that I'm using up all of my main memory on a blocking operation in a pipeline whose memory footprint is otherwise minimal.
I've tried calling iconv like this:
stdbuf -o 0 iconv -f L1 -t UTF-8
But it looks like iconv is managing the buffering internally itself - it's nothing to do with the Linux pipe buffer.
I'm seeing this with the binary that's packaged with gblic 2.6 and 2.7 in Arch Linux, and I've deplicated it with glibc 2.5 in Debian.
Is there some way around this? I know that streaming character conversions are not simple, but I'd have thought that such a commonly used unix tool would work in streams; it's not at all rare to work with files that won't fit in main memory. Would I have to roll my own binary linked to libiconv
?
Consider the iconv(3) call with iconv_open -- hook a simple C routine to those two calls. Read from stdin, write to stdout. Have a read of this example:
http://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html
This example is explictly meant to handle what you are describing. - avoid "stateful" waits for data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With