I have a large gzipped file (5000 columns × 1M lines) consisting of 0's and 1's:
0 1 1 0 0 0 1 1 1....(×5000)
0 0 0 1 0 1 1 0 0
....(×1M)
I want to transpose it, but using numpy or other methods just loads the whole table on the RAM, and I just have at my disposal 6GB.
For this reason, I wanted to use a method that writes each transposed line to an open file, instead of storing it in the RAM. I came up with the following code:
import gzip
with open("output.txt", "w") as out:
with gzip.open("file.txt", "rt") as file:
number_of_columns = len(file.readline().split())
# iterate over number of columns (~5000)
for column in range(number_of_columns):
# in each iteration, go to the top line to start again
file.seek(0)
# initiate list storing the ith column's elements that will form the transposed column
transposed_column = []
# iterate over lines (~1M), storing the ith element in the list
for line in file:
transposed_column.append(line.split()[column])
# write the transposed column as a line to an existing file and back again
out.write(" ".join(transposed_column) + "\n")
However, this is very slow. Can anybody suggest me another solution? Is there any way to append a list as a column (instead of as a line) to an existing open file? (pseudocode):
with open("output.txt", w) as out:
with gzip.open("file.txt", rt) as file:
for line in file:
transposed_line = line.transpose()
out.write(transposed_line, as.column)
UPDATE
The answer of user7813790 lead me to this code:
import numpy as np
import random
# create example array and write to file
with open("array.txt", "w") as out:
num_columns = 8
num_lines = 24
for i in range(num_lines):
line = []
for column in range(num_columns):
line.append(str(random.choice([0,1])))
out.write(" ".join(line) + "\n")
# iterate over chunks of dimensions num_columns×num_columns, transpose them, and append to file
with open("array.txt", "r") as array:
with open("transposed_array.txt", "w") as out:
for chunk_start in range(0, num_lines, num_columns):
# get chunk and transpose
chunk = np.genfromtxt(array, max_rows=num_columns, dtype=int).T
# write out chunk
out.seek(chunk_start+num_columns, 0)
np.savetxt(out, chunk, fmt="%s", delimiter=' ', newline='\n')
It takes a matrix like:
0 0 0 1 1 0 0 0
0 1 1 0 1 1 0 1
0 1 1 0 1 1 0 0
1 0 0 0 0 1 0 1
1 1 0 0 0 1 0 1
0 0 1 1 0 0 1 0
0 0 1 1 1 1 1 0
1 1 1 1 1 0 1 1
0 1 1 0 1 1 1 0
1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1
1 0 0 1 1 0 1 0
0 1 0 1 0 1 0 0
0 0 1 0 0 1 0 0
1 1 1 0 0 1 1 1
1 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1
1 1 1 1 0 1 0 1
1 0 1 1 1 0 0 0
0 1 0 1 1 1 1 1
1 1 1 1 1 1 0 1
0 0 1 1 0 1 1 1
0 1 1 0 1 1 0 1
0 0 1 0 1 1 0 1
and iterates over 2D chunks with both dimensions equal to the number of columns (8 in this case), transposing them and appending them to an output file.
1st chunk transposed:
[[0 0 0 1 1 0 0 1]
[0 1 1 0 1 0 0 1]
[0 1 1 0 0 1 1 1]
[1 0 0 0 0 1 1 1]
[1 1 1 0 0 0 1 1]
[0 1 1 1 1 0 1 0]
[0 0 0 0 0 1 1 1]
[0 1 0 1 1 0 0 1]]
2nd chunk transposed:
[[0 1 1 1 0 0 1 1]
[1 1 1 0 1 0 1 0]
[1 0 0 0 0 1 1 0]
[0 1 1 1 1 0 0 0]
[1 1 1 1 0 0 0 0]
[1 0 0 0 1 1 1 0]
[1 0 1 1 0 0 1 0]
[0 0 1 0 0 0 1 0]]
etc.
I am trying to append each new chunk to the out file as columns, using out.seek(). As far as I understand, seek() takes as first argument the offset from the beginning of the file (i.e. the column), and 0 as a second argument means to start from the first row again. So, I would have guessed that the following line would do the trick:
out.seek(chunk_start+num_columns, 0)
But instead, it does not continue at that offset along the following rows. Also, it adds n = num_columns spaces at the beginning of the first row. Output:
0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 0
1 1 0 1 1 0 1 0
1 1 1 0 1 1 1 1
1 1 1 1 1 1 0 0
1 0 1 1 1 0 1 1
1 1 0 1 1 1 1 1
1 0 0 1 0 1 0 0
1 1 0 1 1 1 1 1
Any insight on how to use seek() properly for this task? i.e. to generate this:
0 0 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0
0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 0
0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 1 1 1 1
1 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0
1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 1
0 1 1 1 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0
0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1
Please mind that this is just a dummy test matrix, the actual matrix is 5008 columns × >1M lines.
UPDATE 2
I have figured out how to make this work, it can also make use of chunks of any dimensions.
import numpy as np
import random
# create example array and write to file
num_columns = 4
num_lines = 8
with open("array.txt", "w") as out:
for i in range(num_lines):
line = []
for column in range(num_columns):
line.append(str(random.choice([0,1])))
out.write(" ".join(line) + "\n")
# iterate over chunks of dimensions num_columns×chunk_length, transpose them, and append to file
chunk_length = 7
with open("array.txt", "r") as array:
with open("transposed_array.txt", "w") as out:
for chunk_start in range(0, num_lines, chunk_length):
# get chunk and transpose
chunk = np.genfromtxt(array, max_rows=chunk_length, dtype=str).T
# write out chunk
empty_line = 2 * (num_lines - (chunk_length + chunk_start))
for i, line in enumerate(chunk):
new_pos = 2 * num_lines * i + 2 * chunk_start
out.seek(new_pos)
out.write(f"{' '.join(line)}{' ' * (empty_line)}"'\n')
In this case, it takes an array like this:
1 1 0 1
0 0 1 0
0 1 1 0
1 1 1 0
0 0 0 1
1 1 0 0
0 1 1 0
0 1 1 1
and transposes it using chunks of 4 columns × 7 lines, so the 1st chunk would be
1 0 0 1 0 1 0
1 0 1 1 0 1 1
0 1 1 1 0 0 1
1 0 0 0 1 0 0
it is written to file, deleted from memory, and then the 2nd chunk is
0
1
1
1
and again it is appended to file, so the final result is:
1 0 0 1 0 1 0 0
1 0 1 1 0 1 1 1
0 1 1 1 0 0 1 1
1 0 0 0 1 0 0 1
In your working but slow solution, you are reading the input file 5,000 times -- that won't be fast, but the only easy way to minimize the reads is to read it all in memory.
You could try some compromise where you read, say, fifty columns at a time into memory (~50MB), and write them into the file as rows. This way you would read the file "only" 100 times. Try a few different combinations to get the performance/memory compromise you're satisfied with.
You would do this over three nested loops:
In your inner-most loop you collect the column values as a row into a two-dimensional array, one row for each of the middle loop. In the outer-most loop, you clear the array before entering the inner loops, and print it out to the file as rows afterwards. For each iteration of loop 1. you will have written fifty rows of a million columns.
You can't really insert in the middle of a normal file without loading the whole target file into memory -- you need to shift the trailing bytes forward manually. Since you know your exact file size, however, you could pre-allocate it and always seek to the position when writing each byte; probably not very fast to do 5 billion seeks, either... If your ones and zeroes are fairly evenly distributed, you could initialize the file with all-zeroes, and then only write ones (or the other way around) to halve the number of seeks.
Edit: Added details how chunking could be implemented.
If your numbers are all 0 or 1 then every line has the same length (in bytes) so you can use file.seek
to move around in the file (rather than reading in and ignoring data). However, this might not be so efficient with a gzipped input file. Since you are writing an uncompressed file, you can also use seek
to jump around in the output.
A more efficient way to transpose an array is to read in a chunk which fits into RAM (e.g. 1000x1000), use numpy.transpose
to transpose the chunk, then write the chunk into its location in the transposed array. With your array which is 5000 columns but 1M rows, it will probably be easiest to use 5000x5000 chunks i.e. read
5000 complete rows of the input matrix at a time. This avoids having to seek
around in the compressed input file. Then you have to write this chunk into the output file, leaving blank space for the columns which come from the subsequent rows of the input.
More detail of how to write chunks to the 5000xN output file (as requested in comment):
To write the first 5000x5000 chunk:
To write the second chunk:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With