Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can getline() be used multiple times within a loop? - Cython, file reading

I want to read a file, 4 lines by 4 (it's a fastq file, with DNA sequences).
When I read the file one line by one or two by two, there's no issues, but when I read 3 or 4 lines at once, my code crashes (kernel appeared to have died on jupyter notebook). (Uncommenting the last part, or any 3 out of the 4 getline().
I tried with a double array of char (char**) to store the lines, with the same issue.

Any idea what can be the cause ?

Using Python 3.7.3, Cython 0.29, all other libraries updated. File being read is about 1.3GB, machine has 8GB, ubuntu 16.04. Code adapted from https://gist.github.com/pydemo/0b85bd5d1c017f6873422e02aeb9618a

%%cython
from libc.stdio cimport FILE, fopen, fclose, getline
    
def fastq_reader(early_stop=10):
    cdef const char* fname = b'/path/to/file'
    cdef FILE* cfile
    cfile = fopen(fname, "rb")

    cdef:
        char * line_0 = NULL
        char * line_1 = NULL
        char * line_2 = NULL
        char * line_3 = NULL
        size_t seed = 0
        ssize_t length_line
        unsigned long long line_nb = 0

    while True:
        length_line = getline(&line_0, &seed, cfile)
        if length_line < 0: break
        
        length_line = getline(&line_1, &seed, cfile)
        if length_line < 0: break
        
#         length_line = getline(&line_2, &seed, cfile)
#         if length_line < 0: break
        
#         length_line = getline(&line_3, &seed, cfile)
#         if length_line < 0: break

        line_nb += 4
        if line_nb > early_stop:
            break

    fclose(cfile)
    return line_nb

fastq_reader(early_stop=20000)
like image 538
Sylvain Avatar asked Apr 25 '21 20:04

Sylvain


People also ask

Does readline read more than one line in Python?

However, does not reads more than one line, even if n exceeds the length of the line. It will be efficient when reading a large file because instead of fetching all the data in one go, it fetches line by line. readline () returns the next line of the file which contains a newline character in the end.

How to read a file in Python using loops?

It’s possible to read a file using loops as well. Using the same wise_owl.txt file that we made in the previous section, we can read every line in the file using a while loop. A wise old owl lived in an oak.

How to read all the lines in a file at once?

Use readline () if you need to read all the lines at once. It’s possible to read a file using loops as well. Using the same wise_owl.txt file that we made in the previous section, we can read every line in the file using a while loop.

How do you read a list of lines in Python?

readlines () is used to read all the lines at a single go and then return them as each line a string element in a list. This function can be used for small files, as it reads the whole file content to the memory, then split it into separate lines.


1 Answers

The underlying problem was my misunderstanding of getline() getline() c reference

To store lines in different variables, an associated n is necessary for each line pointer *lineptr.

If *lineptr is set to NULL and *n is set 0 before the call, then getline() will allocate a buffer for storing the line.

Alternatively, before calling getline(), *lineptr can contain a pointer to a malloc(3)-allocated buffer *n bytes in size. If the buffer is not large enough to hold the line, getline() resizes it with realloc(3), updating *lineptr and *n as necessary.

The n (or seed in my code) will hold the size of the buffer allocated for the pointer, where getline() puts the incoming line. As I set the same buffer variable for different pointers, getline was given the wrong information of the size of the char* line_xxx.


As fastq files are usually in this shape:

@read_id_usually_short
CTATACCACCAAGGCTGGAAATTGTAAAACACACCGCCTGACATATCAATAAGGTGTCAAATTCCCTTTTCTCTAGCTTTCGTACT_very_long
+
-///.)/.-/)//-//..-*...-.&%&.--%#(++*/.//////,/*//+(.///..,%&-#&)..,)/.,.._same_length_as_line_2

There was no error for one or two getline() with the same buffer length, as the buffer was too small and getline sized-up the pointers.
But when using 3 or 4 getlines(), the call to length_line = getline(&line_2, &seed, cfile) was asked to store a char* of length 2 ('+\n'), while getting (the wrong information) that the pointer line_2 was already big enough (size of line_1).


So the (simple) solution is

%%cython
from libc.stdio cimport FILE, fopen, fclose, getline
    
def fastq_reader(early_stop=10):
    cdef const char* fname = b'/path/to/file'
    cdef FILE* cfile
    cfile = fopen(fname, "rb")

    cdef:
        char * line_0 = NULL
        char * line_1 = NULL
        char * line_2 = NULL
        char * line_3 = NULL
        # One variable for each line pointer
        size_t n_0 = 0
        size_t n_1 = 0
        size_t n_2 = 0
        size_t n_3 = 0
        ssize_t length_line
        unsigned long long line_nb = 0

    while True:
        # Reading the same file (same cfile), but line_x and n_x by pairs)
        length_line = getline(&line_0, &n_0, cfile)  
        if length_line < 0: break
        
        length_line = getline(&line_1, &n_1, cfile)
        if length_line < 0: break
        
        length_line = getline(&line_2, &n_2, cfile)
        if length_line < 0: break
        
        length_line = getline(&line_3, &n_3, cfile)
        if length_line < 0: break

        line_nb += 4
        if line_nb > early_stop:
            break

    fclose(cfile)
    return line_nb

fastq_reader(early_stop=20000)

Thanks for pointing out my mistake.

like image 142
Sylvain Avatar answered Oct 16 '22 19:10

Sylvain