a friend of mine wrote this little progam.
the textFile
is 1.2GB in size (7 years worth of newspapers).
He successfully manages to create the dictionary but he cannot write it to a file using pickle(program hangs).
import sys
import string
import cPickle as pickle
biGramDict = {}
textFile = open(str(sys.argv[1]), 'r')
biGramDictFile = open(str(sys.argv[2]), 'w')
for line in textFile:
if (line.find('<s>')!=-1):
old = None
for line2 in textFile:
if (line2.find('</s>')!=-1):
break
else:
line2=line2.strip()
if line2 not in string.punctuation:
if old != None:
if old not in biGramDict:
biGramDict[old] = {}
if line2 not in biGramDict[old]:
biGramDict[old][line2] = 0
biGramDict[old][line2]+=1
old=line2
textFile.close()
print "going to pickle..."
pickle.dump(biGramDict, biGramDictFile,2)
print "pickle done. now load it..."
biGramDictFile.close()
biGramDictFile = open(str(sys.argv[2]), 'r')
newBiGramDict = pickle.load(biGramDictFile)
thanks in advance.
EDIT
for anyone interested i will briefly explain what this program does.
assuming you have a file formated roughly like this:
<s>
Hello
,
World
!
</s>
<s>
Hello
,
munde
!
</s>
<s>
World
domination
.
</s>
<s>
Total
World
domination
!
</s>
<s>
are sentences separators. a biGramDictionary is generated for later use.
something like this:
{
"Hello": {"World": 1, "munde": 1},
"World": {"domination": 2},
"Total": {"World": 1},
}
hope this helps. right now the strategy changed to using mysql because sqlite just wasn't working (probably because of the size)
Python objects can be saved (or serialized) as pickle files for later use.
To save a pickle, use pickle. dump . A convention is to name pickle files *. pickle , but you can name it whatever you want.
In general, pickling a dict will fail unless you have only simple objects in it, like strings and integers. Even a really simple dict will often fail. It just depends on the contents. Or if you want to save your dict to a file...
Pickle is only meant to write complete (small) objects. Your dictionary is a bit large to even hold in memory, you'd better use a database instead so you can store and retrieve entries one by one instead of all at once.
Some good and easily integratable singe-file database formats you can use from Python are SQLite or one of the DBM variants. The last one acts just like a dictionary (i.e. you can read and write key/value-pairs) but uses the disk as storage rather than 1.2 GBs of memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With