I want to read a file, 4 lines by 4 (it's a fastq file, with DNA sequences).
When I read the file one line by one or two by two, there's no issues, but when I read 3 or 4 lines at once, my code crashes (kernel appeared to have died on jupyter notebook). (Uncommenting the last part, or any 3 out of the 4 getline()
.
I tried with a double array of char (char**) to store the lines, with the same issue.
Any idea what can be the cause ?
Using Python 3.7.3, Cython 0.29, all other libraries updated. File being read is about 1.3GB, machine has 8GB, ubuntu 16.04. Code adapted from https://gist.github.com/pydemo/0b85bd5d1c017f6873422e02aeb9618a
%%cython
from libc.stdio cimport FILE, fopen, fclose, getline
def fastq_reader(early_stop=10):
cdef const char* fname = b'/path/to/file'
cdef FILE* cfile
cfile = fopen(fname, "rb")
cdef:
char * line_0 = NULL
char * line_1 = NULL
char * line_2 = NULL
char * line_3 = NULL
size_t seed = 0
ssize_t length_line
unsigned long long line_nb = 0
while True:
length_line = getline(&line_0, &seed, cfile)
if length_line < 0: break
length_line = getline(&line_1, &seed, cfile)
if length_line < 0: break
# length_line = getline(&line_2, &seed, cfile)
# if length_line < 0: break
# length_line = getline(&line_3, &seed, cfile)
# if length_line < 0: break
line_nb += 4
if line_nb > early_stop:
break
fclose(cfile)
return line_nb
fastq_reader(early_stop=20000)
However, does not reads more than one line, even if n exceeds the length of the line. It will be efficient when reading a large file because instead of fetching all the data in one go, it fetches line by line. readline () returns the next line of the file which contains a newline character in the end.
It’s possible to read a file using loops as well. Using the same wise_owl.txt file that we made in the previous section, we can read every line in the file using a while loop. A wise old owl lived in an oak.
Use readline () if you need to read all the lines at once. It’s possible to read a file using loops as well. Using the same wise_owl.txt file that we made in the previous section, we can read every line in the file using a while loop.
readlines () is used to read all the lines at a single go and then return them as each line a string element in a list. This function can be used for small files, as it reads the whole file content to the memory, then split it into separate lines.
The underlying problem was my misunderstanding of getline()
getline() c reference
To store lines in different variables, an associated n
is necessary for each line pointer *lineptr
.
If *lineptr is set to NULL and *n is set 0 before the call, then getline() will allocate a buffer for storing the line.
Alternatively, before calling getline(), *lineptr can contain a pointer to a malloc(3)-allocated buffer *n bytes in size. If the buffer is not large enough to hold the line, getline() resizes it with realloc(3), updating *lineptr and *n as necessary.
The n
(or seed
in my code) will hold the size of the buffer allocated for the pointer, where getline() puts the incoming line. As I set the same buffer variable for different pointers, getline was given the wrong information of the size of the char* line_xxx.
As fastq files are usually in this shape:
@read_id_usually_short
CTATACCACCAAGGCTGGAAATTGTAAAACACACCGCCTGACATATCAATAAGGTGTCAAATTCCCTTTTCTCTAGCTTTCGTACT_very_long
+
-///.)/.-/)//-//..-*...-.&%&.--%#(++*/.//////,/*//+(.///..,%&-#&)..,)/.,.._same_length_as_line_2
There was no error for one or two getline() with the same buffer length, as the buffer was too small and getline sized-up the pointers.
But when using 3 or 4 getlines(), the call to length_line = getline(&line_2, &seed, cfile)
was asked to store a char* of length 2 ('+\n'), while getting (the wrong information) that the pointer line_2
was already big enough (size of line_1).
So the (simple) solution is
%%cython
from libc.stdio cimport FILE, fopen, fclose, getline
def fastq_reader(early_stop=10):
cdef const char* fname = b'/path/to/file'
cdef FILE* cfile
cfile = fopen(fname, "rb")
cdef:
char * line_0 = NULL
char * line_1 = NULL
char * line_2 = NULL
char * line_3 = NULL
# One variable for each line pointer
size_t n_0 = 0
size_t n_1 = 0
size_t n_2 = 0
size_t n_3 = 0
ssize_t length_line
unsigned long long line_nb = 0
while True:
# Reading the same file (same cfile), but line_x and n_x by pairs)
length_line = getline(&line_0, &n_0, cfile)
if length_line < 0: break
length_line = getline(&line_1, &n_1, cfile)
if length_line < 0: break
length_line = getline(&line_2, &n_2, cfile)
if length_line < 0: break
length_line = getline(&line_3, &n_3, cfile)
if length_line < 0: break
line_nb += 4
if line_nb > early_stop:
break
fclose(cfile)
return line_nb
fastq_reader(early_stop=20000)
Thanks for pointing out my mistake.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With