I am dealing with several large txt file, each of them has about 8000000 lines. A short example of the lines are:
usedfor zipper fasten_coat
usedfor zipper fasten_jacket
usedfor zipper fasten_pant
usedfor your_foot walk
atlocation camera cupboard
atlocation camera drawer
atlocation camera house
relatedto more plenty
The code to store them in a dictionary is:
dicCSK = collections.defaultdict(list)
for line in finCSK:
line=line.strip('\n')
try:
r, c1, c2 = line.split(" ")
except ValueError:
print line
dicCSK[c1].append(r+" "+c2)
It runs good in the first txt file, but when it runs to the second txt file, I got an error MemoryError
.
I am using window 7 64bit with python 2.7 32bit, intel i5 cpu, with 8Gb memory. How can I solve the problem?
Further explaining:
I have four large files, each file contains different information for many entities. For example, I want to find all information for cat
, its father node animal
and its child node persian cat
and so on. So my program first read all txt files in the dictionary, then I scan all dictionaries to find information for cat
and its father and its children.
Python doesn't limit memory usage on your program. It will allocate as much memory as your program needs until your computer is out of memory. The most you can do is reduce the limit to a fixed upper cap. That can be done with the resource module, but it isn't what you're looking for.
Causes of such memory errors may be due to certain cognitive factors, such as spreading activation, or to physiological factors, including brain damage, age or emotional factors. Furthermore, memory errors have been reported in individuals with schizophrenia and depression.
Simplest solution: You're probably running out of virtual address space (any other form of error usually means running really slowly for a long time before you finally get a MemoryError
). This is because a 32 bit application on Windows (and most OSes) is limited to 2 GB of user mode address space (Windows can be tweaked to make it 3 GB, but that's still a low cap). You've got 8 GB of RAM, but your program can't use (at least) 3/4 of it. Python has a fair amount of per-object overhead (object header, allocation alignment, etc.), odds are the strings alone are using close to a GB of RAM, and that's before you deal with the overhead of the dictionary, the rest of your program, the rest of Python, etc. If memory space fragments enough, and the dictionary needs to grow, it may not have enough contiguous space to reallocate, and you'll get a MemoryError
.
Install a 64 bit version of Python (if you can, I'd recommend upgrading to Python 3 for other reasons); it will use more memory, but then, it will have access to a lot more memory space (and more physical RAM as well).
If that's not enough, consider converting to a sqlite3
database (or some other DB), so it naturally spills to disk when the data gets too large for main memory, while still having fairly efficient lookup.
Assuming your example text is representative of all the text, one line would consume about 75 bytes on my machine:
In [3]: sys.getsizeof('usedfor zipper fasten_coat')
Out[3]: 75
Doing some rough math:
75 bytes * 8,000,000 lines / 1024 / 1024 = ~572 MB
So roughly 572 meg to store the strings alone for one of these files. Once you start adding in additional, similarly structured and sized files, you'll quickly approach your virtual address space limits, as mentioned in @ShadowRanger's answer.
If upgrading your python isn't feasible for you, or if it only kicks the can down the road (you have finite physical memory after all), you really have two options: write your results to temporary files in-between loading in and reading the input files, or write your results to a database. Since you need to further post-process the strings after aggregating them, writing to a database would be the superior approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With