Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading large file (25k entries) into dict is slow in Python?

I have a file which has about 25000 lines, and it's a s19 format file.

each line is like: S214 780010 00802000000010000000000A508CC78C 7A

There are no spaces in the actual file, the first part 780010 is the address of this line, and I want it to be a dict's key value, and I want the data part 00802000000010000000000A508CC78C be the value of this key. I wrote my code like this:

def __init__(self,filename):
    infile = file(filename,'r')
    self.all_lines = infile.readlines()
    self.dict_by_address = {}

    for i in range(0, self.get_line_number()):
        self.dict_by_address[self.get_address_of_line(i)] = self.get_data_of_line(i)

    infile.close()

get_address_of_line() and get_data_of_line() are all simply string slicing functions. get_line_number() iterates over self.all_lines and returns an int

problem is, the init process takes me over 1 min, is the way I construct the dict wrong or python just need so long to do this?

And by the way, I'm new to python:) maybe the code looks more C/C++ like, any advice of how to program like python is appreciated:)

like image 335
shengy Avatar asked Apr 16 '12 03:04

shengy


People also ask

Are Dictionaries slow in Python?

Dictionaries in Python will find keys in O(1) on average. But complexity is not the only factor in execution time. For example accessing a list item by its index is also O(1) but it is considerably faster than accessing a dictionary by its key.

How large can a Python dictionary be?

It will not display the output because the computer ran out of memory before reaching 2^27. So there is no size limitation in the dictionary.

How do I save a large dictionary in Python?

If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.


1 Answers

How about something like this? (I made a test file with just a line S21478001000802000000010000000000A508CC78C7A so you might have to adjust the slicing.)

>>> with open('test.test') as f:
...     dict_by_address = {line[4:10]:line[10:-3] for line in f}
... 
>>> dict_by_address
{'780010': '00802000000010000000000A508CC78C'}
like image 169
Nolen Royalty Avatar answered Sep 19 '22 13:09

Nolen Royalty