I have 40,000 data files. Each file contains 1445 lines of floating numbers in single column. Now I need to rearrange the data in different order.
The first number from each data file need to be collected and dumped in a new file (lets say abc1.dat). This particular file (abc1.dat) will contain 40,000 numbers.
And the second number from each data file need to be extracted and dumped in a another new file (let's say abc2.dat). This new file also will be containing 40,000 numbers. But only second numbers from each data file.
At the end of this operation I supposed have 1445 files (abc1.dat, abc2.dat,...abc40000.dat) and each contains 40,000 data.
How this can be achieved ? (Using Linux Ubuntu 11.10 - 64 bit)
Appreciate any help. Advance Thanks.
40,000 * 1445 is not so many, it should fit into memory. So, in Perl (untested):
#!/usr/bin/perl
use strict;
use warnings;
my @nums;
# Reading:
for my $file (0 .. 40_000) {
open my $IN, '<', "file-$file" or die $!;
while (<$IN>) {
chomp;
$nums[$file][$.-1] = $_;
}
}
# Writing:
for my $line (0 .. 1444) {
open my $OUT, '>', "abc$line.dat" or die $!;
for my $file (0 .. 40_000) {
print $OUT $nums[$file][$line], "\n";
}
}
If you can open all 1445 output files at once, this is pretty easy:
paths = ['abc{}.dat'.format(i) for i in range(1445)]
files = [open(path, 'w') for path in paths]
for inpath in ('input{}.dat'.format(i) for i in range(40000)):
with infile as open(inpath, 'r') as infile:
for linenum, line in enumerate(infile):
files[linenum].write(line)
for f in files:
f.close()
If you can fit everything into memory (it sounds like this should be about 0.5-5.0 GB of data, which may be fine for a 64-bit machine with 8GB of RAM…), you can do it this way:
data = [[] for _ in range(1445)]
for inpath in ('input{}.dat'.format(i) for i in range(40000)):
with infile as open(inpath, 'r') as infile:
for linenum, line in enumerate(infile):
data[linenum].append(line)
for i, contents in enumerate(data):
with open('abc{}.dat'.format(i), 'w') as outfile:
outfile.write(''.join(contents)
If neither of these is appropriate, you may want some kind of hybrid. For example, if you can do 250 files at once, do 6 batches, and skip over batchnum
*250 lines in each infile
.
If the batch solution is too slow, at the end of each batch in each file, stash infile.tell()
, and when you come back to the file again, use infile.seek()
to get back there. Something like this:
seekpoints = [0 for _ in range(40000)]
for batch in range(6):
start = batch * 250
stop = min(start + 250, 1445)
paths = ['abc{}.dat'.format(i) for i in range(start, stop)]
files = [open(path, 'w') for path in paths]
for infilenum, inpath in enumerate('input{}.dat'.format(i) for i in range(40000)):
with infile as open(inpath, 'r') as infile:
infile.seek(seekpoints[infilenum])
for linenum, line in enumerate(infile):
files[linenum].write(line)
seekpoints[infilenum] = infile.tell()
for f in files:
f.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With