Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python parsing a huge file

I am looking for efficient way to load a huge file with data.

The file has the following format

1\tword1\tdata

2\tword2\tdata

3\tword3\tdata

\r\n

1\tword4\tdata

2\tword2\tdata

\r\n

where \r\n defines the end of the sentences which consist of the words.

I am interested in loading the file and in saving the structure, i.e. I want to refer to sentence and to the word in the sentence, in general as result I want to get something like this

data = [sentence1, sentence2,... ]

where sentence = [word1,word2,...]

Loading file line by line take a lot of time, loading file by batches much more efficient, however I don't know how to parse and divide the data to the sentences.

Currently I use the following code

def loadf(filename):
    n = 100000
    data = []
    with open(filename) as f:
        while True:
            next_n_lines = list(islice(f, n))
            if not next_n_lines:
                break
            data.extend([line.strip().split('\t') for line in next_n_lines])

With this code I don't know how to divide the data to the sentences, in addition I suspect that extend not actually extend the current list but create a new one and reassign, because it's extremely slow.

I would appreciate any help.

like image 246
user16168 Avatar asked Mar 04 '26 09:03

user16168


1 Answers

How about:

import csv
from itertools import groupby

with open(yourfile) as fin:
    tabin = csv.reader(fin, delimiter='\t')
    sentences = [[el[1] for el in g] for k, g in groupby(tabin, bool) if k]
like image 163
Jon Clements Avatar answered Mar 05 '26 23:03

Jon Clements