Python parsing a huge file

Question

I am looking for efficient way to load a huge file with data.

The file has the following format

1 word1 data

2 word2 data

3 word3 data

1 word4 data

2 word2 data

where defines the end of the sentences which consist of the words.

I am interested in loading the file and in saving the structure, i.e. I want to refer to sentence and to the word in the sentence, in general as result I want to get something like this

data = [sentence1, sentence2,... ]

where sentence = [word1,word2,...]

Loading file line by line take a lot of time, loading file by batches much more efficient, however I don't know how to parse and divide the data to the sentences.

Currently I use the following code

def loadf(filename):
    n = 100000
    data = []
    with open(filename) as f:
        while True:
            next_n_lines = list(islice(f, n))
            if not next_n_lines:
                break
            data.extend([line.strip().split('	') for line in next_n_lines])

With this code I don't know how to divide the data to the sentences, in addition I suspect that extend not actually extend the current list but create a new one and reassign, because it's extremely slow.

I would appreciate any help.

Jon Clements · Accepted Answer

How about:

import csv
from itertools import groupby

with open(yourfile) as fin:
    tabin = csv.reader(fin, delimiter='	')
    sentences = [[el[1] for el in g] for k, g in groupby(tabin, bool) if k]

Python parsing a huge file

Tags:

python

file

parsing

user16168

1 Answers

Jon Clements

Recent Activity

Donate For Us

Python parsing a huge file

Tags:

python

file

parsing

user16168

1 Answers

Jon Clements

Related questions

Recent Activity

Donate For Us