Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rearranging data from multiple data files

I have 40,000 data files. Each file contains 1445 lines of floating numbers in single column. Now I need to rearrange the data in different order.

The first number from each data file need to be collected and dumped in a new file (lets say abc1.dat). This particular file (abc1.dat) will contain 40,000 numbers.

And the second number from each data file need to be extracted and dumped in a another new file (let's say abc2.dat). This new file also will be containing 40,000 numbers. But only second numbers from each data file.

At the end of this operation I supposed have 1445 files (abc1.dat, abc2.dat,...abc40000.dat) and each contains 40,000 data.

How this can be achieved ? (Using Linux Ubuntu 11.10 - 64 bit)

Appreciate any help. Advance Thanks.

like image 668
Vijay Avatar asked Dec 08 '22 18:12

Vijay


2 Answers

40,000 * 1445 is not so many, it should fit into memory. So, in Perl (untested):

#!/usr/bin/perl
use strict;
use warnings;

my @nums;
# Reading:
for my $file (0 .. 40_000) {
    open my $IN, '<', "file-$file" or die $!;
    while (<$IN>) {
        chomp;
        $nums[$file][$.-1] = $_;
    }
}

# Writing:
for my $line (0 .. 1444) {
    open my $OUT, '>', "abc$line.dat" or die $!;
    for my $file (0 .. 40_000) {
        print $OUT $nums[$file][$line], "\n";
    }
}
like image 109
choroba Avatar answered Dec 11 '22 09:12

choroba


If you can open all 1445 output files at once, this is pretty easy:

paths = ['abc{}.dat'.format(i) for i in range(1445)]
files = [open(path, 'w') for path in paths]
for inpath in ('input{}.dat'.format(i) for i in range(40000)):
    with infile as open(inpath, 'r') as infile:
        for linenum, line in enumerate(infile):
            files[linenum].write(line)
for f in files:
    f.close()

If you can fit everything into memory (it sounds like this should be about 0.5-5.0 GB of data, which may be fine for a 64-bit machine with 8GB of RAM…), you can do it this way:

data = [[] for _ in range(1445)]
for inpath in ('input{}.dat'.format(i) for i in range(40000)):
    with infile as open(inpath, 'r') as infile:
        for linenum, line in enumerate(infile):
            data[linenum].append(line)
for i, contents in enumerate(data):
    with open('abc{}.dat'.format(i), 'w') as outfile:
        outfile.write(''.join(contents)

If neither of these is appropriate, you may want some kind of hybrid. For example, if you can do 250 files at once, do 6 batches, and skip over batchnum*250 lines in each infile.

If the batch solution is too slow, at the end of each batch in each file, stash infile.tell(), and when you come back to the file again, use infile.seek() to get back there. Something like this:

seekpoints = [0 for _ in range(40000)]
for batch in range(6):
    start = batch * 250
    stop = min(start + 250, 1445)
    paths = ['abc{}.dat'.format(i) for i in range(start, stop)]
    files = [open(path, 'w') for path in paths]
    for infilenum, inpath in enumerate('input{}.dat'.format(i) for i in range(40000)):
        with infile as open(inpath, 'r') as infile:
            infile.seek(seekpoints[infilenum])
            for linenum, line in enumerate(infile):
                files[linenum].write(line)
            seekpoints[infilenum] = infile.tell()
    for f in files:
        f.close()
like image 24
abarnert Avatar answered Dec 11 '22 07:12

abarnert