Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi stream iterators c++

The purpose of my program is to open a text file of m lines of the same length n, read the file column by column and print each column.

For example, for this text file

abcd
efgh 
jklm

I would like to print

a e j
b f k
c g l
d h m

As one line length can be 200 000 000 and the column length can be more than 10 000, I can't open all the file in memory in a matrix.

Theoretically, I would like to have a program that use O(m) in space and O(m*n) in time.

At the beginning, I had to think about these solutions:

  • if I see all the file for each column the complexity is O(m*n²),
  • If I use seekg and an array of positions and jump from position to position, the complexity is O(mnlog(n)).

Last point, for some server problems, I need to use only the STL.

My last idea is to create an array of iterators of a file and initialized these iterators at the beginning of each line. After that, to see the next column, I only need to increase each iterator. This is my code

ifstream str2;
str2.open ("Input/test.data", ifstream::in);

int nbline = 3;
int nbcolumn = 4;
int x = 0;

istreambuf_iterator<char> istart (str2);
istreambuf_iterator<char> iend ;

istreambuf_iterator<char>* iarray;
iarray = new istreambuf_iterator<char>[nbline];


while (istart != iend){
    if (x % nbcolumn == 0){
        iarray[x/nbcolumn] = istart;
    }
    istart++;
    x++;
}

for (int j = 0; j<nbcolumn;j++){
    for (int i = 0; i<nbline;i++){
        cout  << *iarray[i] << "\t";
        iarray[i]++;
    }
    cout << endl;
}

Sadly, it does not work, and I have this thing as output

a       e       f       
�       �       �       
�       �       �       
�       �       �       

I think the problem is that the array of iterators iarray are not independent of istart, how I can do that?

like image 404
B. Hel Avatar asked Oct 22 '18 10:10

B. Hel


2 Answers

You can break the task into chunks, then process each chunk before moving on to the next.

You'd need a buffer for each line (the larger this is the better the performance will be) and the seek position for that row. You may also need to make an initial pass thru the file to get the correct offsets for each row.

Read B bytes into the buffer for each row (using tellg to save the position in each row), then loop over those and generate your output. Go back and read the next B bytes from each row (using seekg to set the file position beforehand, and tellg to remember it afterwards) and generate the output. Repeat until you're done, being careful with the last chunk (or with small inputs) to not go past the end of the line.

Using your example, you have 3 rows to keep track of. Using a B size of 2, you'd read in ab, ef, and jk into your 3 buffers. Looping over those you'd output aej and bfk. Go back and read the next chunks: cd, gh, and lm. This gives cgl and dhm as output.

like image 111
1201ProgramAlarm Avatar answered Oct 21 '22 08:10

1201ProgramAlarm


I would do this like this:

  1. Open the source file.
  2. Measure line size
  3. Measure line count (file size / (line size + size of EOL)). Note EOL can be 2 bytes.
  4. calculate result file size. Open result file and force it to have it desired size, so later you can seek to any part of the file.
  5. peak some size of square which is memory manageable. For example 1024x1024
  6. Now you should load square part of the matrix. 1024 elements for rows of 1024 constitutive rows.
  7. Transpose square
  8. Write it to destination file by seeking to proper column of each part of row you are writing. (you can reduce memory consumption in previous point by transposing one column and then write it as a row, instead transposing whole square at once)
  9. Iterate square over whole file matrix

IMO you can't do it better. Most critical will be how to select size of the square. Big power of 2 is recommended.

like image 5
Marek R Avatar answered Oct 21 '22 06:10

Marek R