Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I perform a streaming character conversion? [closed]

I have data stored on disk in files that are far too big to store in main memory.

I want to stream this data from the disk into a data processing pipeline via iconv, like this:

zcat myfile | iconv -f L1 -t UTF-8 | # rest of the pipeline goes here

Unfortunately, I'm seeing iconv buffer the entire file in memory until it's exhausted before outputting any data. This means that I'm using up all of my main memory on a blocking operation in a pipeline whose memory footprint is otherwise minimal.

I've tried calling iconv like this:

stdbuf -o 0 iconv -f L1 -t UTF-8

But it looks like iconv is managing the buffering internally itself - it's nothing to do with the Linux pipe buffer.

I'm seeing this with the binary that's packaged with gblic 2.6 and 2.7 in Arch Linux, and I've deplicated it with glibc 2.5 in Debian.

Is there some way around this? I know that streaming character conversions are not simple, but I'd have thought that such a commonly used unix tool would work in streams; it's not at all rare to work with files that won't fit in main memory. Would I have to roll my own binary linked to libiconv?

like image 545
Cera Avatar asked Nov 03 '22 06:11

Cera


1 Answers

Consider the iconv(3) call with iconv_open -- hook a simple C routine to those two calls. Read from stdin, write to stdout. Have a read of this example:

http://www.gnu.org/software/libc/manual/html_node/iconv-Examples.html

This example is explictly meant to handle what you are describing. - avoid "stateful" waits for data.

like image 107
jim mcnamara Avatar answered Nov 09 '22 08:11

jim mcnamara