Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently Working with (and generating) Large Text Files

Tags:

As part of my work, I am working with very large text files and, in part, analyzing them for word and phrase frequency. I am running into difficulties of computing time, memory restrictions, and in extracting relevant information.

For this program, I am taking a large text file (say 50MB) which has already been cleaned up, turned into lower case. But otherwise it is just unstructured text. I am trying to generate lists of 'bigrams,' 'trigrams,' 'quadgrams,' and 'fivegrams' - respectively, combinations of recurring two, three, four, and five word phrases (i.e. "i am" is a bigram, "i am free" is a trigram, "i am free always" is a quadgram).

What am I currently doing?

Here is my current code, where inputlower is an all lowercase String (scraped web data w/ Mathematica).

inputlower=Import["/directory/allTextLowered.txt"];
bigrams = 
  Sort[Tally[Partition[inputlower, 2, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/bigrams.txt", bigrams];    
Clear[bigrams];
trigrams = 
  Sort[Tally[Partition[inputlower, 3, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/trigrams.txt", trigrams];
Clear[trigrams];    
quadgrams = 
  Sort[Tally[Partition[inputlower, 4, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/quadrams.txt", quadgrams];
Clear[quadgrams];
fivegrams = 
  Sort[Tally[Partition[inputlower, 5, 1]], #1[[2]] > #2[[2]] &];
Export["/directory/fivegrams.txt", fivegrams];

In a way, it works well: I do get information generated, and on smaller scales I find that this code works fast enough that I can have something approximating a workable Manipulate[] program. But when we're dealing with big inputs...

What is wrong with it when I use big files?

Most importantly, my output files are too big to be usable. Is there a way to specify a breaking point in the code: for example, I do not want any 'bigrams' that only appear once? If that proves to still leave too much information, would there be a way to specify that I do not want any 'bigrams' in the file unless they appear more than say 10 times? i.e. if "my cheese" appears 20 times, I want to know about it, but if "i pad" only appears once, maybe losing it would make the file more manageable?

Secondly, these processes take a long time: it took well over two or three hours to generate the bigram output alone. Am I approaching this problem in an efficient manner?

Thirdly, if I did have a large bigram file (~650MB+) that contained ALL of the information, is there a way for Mathematica to access information without loading it all into memory - i.e. taking a file named bigrams.txt, learning that it contains {{"i","am"},55} without bogging the system down?

Edit

[As of 7 Dec 11, I've removed the example file that I've put up - thanks to all again]

like image 332
canadian_scholar Avatar asked Nov 23 '11 18:11

canadian_scholar


People also ask

How to open txt files larger than 1gb?

Solution 1: Download a Dedicated Large File Viewer If all you need to do is read the large file, you can download a dedicated large file viewer such as the Large Text File Viewer. Such tools will open large text files with ease.

What editor can open large files?

Modern editors can handle surprisingly large files. In particular, Vim (Windows, macOS, Linux), Emacs (Windows, macOS, Linux), Notepad++ (Windows), Sublime Text (Windows, macOS, Linux), and VS Code (Windows, macOS, Linux) support large (~4 GB) files, assuming you have the RAM.


1 Answers

Introduction

What I will propose is different from most suggestions given before, and based on a combination of indexing, hash-tables, packed arrays, Compress, .mx files and DumpSave, and a few other things. The basic idea is to preprocess the file in a smart way, and save the preprocessed definitions in an .mx file for fast loading. Rather than base most of the work on reading from disk, I propose to shift the accents, and do most of the work in-memory, but find ways to load data from disk, store it in RAM, work with data, and save it on disk in both time and memory - efficient manner. In trying to achieve this goal, I will use most of the efficient Mathematica constructions I am aware of, both for in-memory work and interaction with the file system.

Code

Here is the code:

Clear[words];
words[text_String] :=  ToLowerCase[StringCases[text, WordCharacter ..]];

(* Rules to replace words with integer indices, and back *)
Clear[makeWordIndexRules];
makeWordIndexRules[sym_Symbol, words : {__String}] :=
   With[{distinctWords = DeleteDuplicates[words]},
    sym["Direct"] = Dispatch[Thread[distinctWords -> Range[Length[distinctWords]]]];
    sym["Inverse"] = Dispatch[Thread[ Range[Length[distinctWords]] -> distinctWords]];
    sym["keys"] = distinctWords;
];

(* Make a symbol with DownValues / OwnValues self - uncompressing *)
ClearAll[defineCompressed];
SetAttributes[defineCompressed, HoldFirst];
defineCompressed[sym_Symbol, valueType_: DownValues] :=
  With[{newVals = 
     valueType[sym] /.
       Verbatim[RuleDelayed][
         hpt : Verbatim[HoldPattern][HoldPattern[pt_]], rhs_] :>
            With[{eval = Compress@rhs}, hpt :> (pt = Uncompress@ eval)]
        },
        ClearAll[sym];
        sym := (ClearAll[sym]; valueType[sym] = newVals; sym)
];


(* Get a list of indices corresponding to a full list of words in a text *)
Clear[getWordsIndices];
getWordsIndices[sym_, words : {__String}] :=
    Developer`ToPackedArray[words /. sym["Direct"]];

(* Compute the combinations and their frequencies *)
Clear[getSortedNgramsAndFreqs];
getSortedNgramsAndFreqs[input_List, n_Integer] :=
   Reverse[#[[Ordering[#[[All, 2]]]]]] &@ Tally[Partition[input, n, 1]];


(* 
 ** Produce n-grams and store them in a hash-table. We split combinations from
 ** their frequencies, and assume indices for input, to utilize packed arrays 
 *)
Clear[produceIndexedNgrams];
produceIndexedNgrams[sym_Symbol, input_List, range : {__Integer}] :=
 Do[
   With[{ngramsAndFreqs = getSortedNgramsAndFreqs[input, i]},
     sym["NGrams", i] = Developer`ToPackedArray[ngramsAndFreqs[[All, 1]]];
     sym["Frequencies", i] =  Developer`ToPackedArray[ngramsAndFreqs[[All, 2]]]
   ],
   {i, range}];


(* Higher - level function to preprocess the text and populate the hash - tables *)
ClearAll[preprocess];
SetAttributes[preprocess, HoldRest];
preprocess[text_String, inputWordList_Symbol, wordIndexRuleSym_Symbol,
    ngramsSym_Symbol, nrange_] /; MatchQ[nrange, {__Integer}] :=
  Module[{},
    Clear[inputWordList, wordIndexRuleSym, ngramsSym];
    inputWordList = words@text;
    makeWordIndexRules[wordIndexRuleSym, inputWordList];
    produceIndexedNgrams[ngramsSym,
    getWordsIndices[wordIndexRuleSym, inputWordList], nrange]
  ];

(* Higher - level function to make the definitions auto-uncompressing and save them*)
ClearAll[saveCompressed];
SetAttributes[saveCompressed, HoldRest];
saveCompressed[filename_String, inputWordList_Symbol, wordIndexRuleSym_Symbol, 
   ngramsSym_Symbol] :=
Module[{},
   defineCompressed /@ {wordIndexRuleSym, ngramsSym};
   defineCompressed[inputWordList, OwnValues];
   DumpSave[filename, {inputWordList, wordIndexRuleSym, ngramsSym}];
];

The above functionality is very memory - hungry: for processing of @Ian's file it took at some point nearly 5Gb of RAM. However, this is worth it, and one can also test the above with smaller files if there is not enough RAM. Generally, large files could be split into several parts, to work around this problem.

Preprocessing and saving in optimized format

We now start. Preprocessing takes about a minute on my machine:

test = Import["C:\\Temp\\lowered-text-50.txt", "Text"];

In[64]:= preprocess[test,inputlower,wordIndexRules,ngrams,{2,3}];//Timing
Out[64]= {55.895,Null}

The symbols inputlower, wordIndexRules, ngrams here are some symbols I chose to use for the list of words in a file and for hash-tables. Here are some additional inputs illustrating how these symbols are used and what they mean:

In[65]:= ByteCount[inputlower]
Out[65]= 459617456

In[69]:= inputlower[[1000;;1010]]
Out[69]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[67]:= toNumbers = inputlower[[1000;;1010]]/.wordIndexRules["Direct"]
Out[67]= {58,220,28,58,392,393,25,1,216,25,1}

In[68]:= toWords =toNumbers/. wordIndexRules["Inverse"]
Out[68]= {le,fort,edmonton,le,principal,entrepôt,de,la,compagnie,de,la}

In[70]:= {ngrams["NGrams",2],ngrams["Frequencies",2]}//Short
Out[70]//Short= {{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}

The main idea here is that we use integer indices instead of words (strings), which allows us to utilize packed arrays for n-grams.

Compressing and saving takes another half a minute:

In[71]:= saveCompressed["C:\\Temp\\largeTextInfo.mx", inputlower, 
      wordIndexRules, ngrams] // Timing
Out[71]= {30.405, Null}

The resulting .mx file is about 63MB, which is about the size of the original file. Note that since a part of what we save is a (self-compressed) variable inputlower, which contains all the input words in the original order, we are not losing any information as compared to the original file. In principle, one can from now on start working with the new .mx file only.

Working with the optimized file

We now start a new session by quitting the kernel. It takes almost no time to load the file (.mx format is extremely efficient):

In[1]:= Get["C:\\Temp\\largeTextInfo.mx"] // Timing
Out[1]= {0.016, Null}

Loading the list of words takes some time (self - uncompressing):

In[2]:= inputlower//Short//Timing
Out[2]= {6.52,{la,présente,collection,numérisée,<<8000557>>,quicktime,3,0}}

but we don't use it for anything - it is stored just in case. Loading the 2-grams and their frequencies:

In[3]:= Timing[Short[ngrams2 = {ngrams["NGrams",2],ngrams["Frequencies",2]}]]
Out[3]= {0.639,{{{793,791},{25,1},{4704,791},<<2079937>>,{79,80},{77,78},{33,34}},{<<1>>}}}

Note that most of the time here was spent on self - uncompressing, which is selective (so, e.g., ngrams["NGrams",3] is still compressed). Loading the 3-grams and their frequencies:

In[4]:= Timing[Short[ngrams3 = {ngrams["NGrams",3],ngrams["Frequencies",3]}]]
Out[4]= {1.357,{{{11333,793,11334},{793,11334,11356},<<4642628>>,{18,21,22},{20,18,21}},{<<1>>}}}

The timings are decent, given the size of the lists. Note that neither DumpSave - Get nor Compress - Uncompress unpack packed arrays, so our data is stored in Mathematica's memory pretty efficiently:

In[5]:= Developer`PackedArrayQ/@ngrams3
Out[5]= {True,True}

Here we uncompress the rules relating indices to words:

In[6]:= Timing[Short[wordIndexRules["Inverse"]]]
Out[6]= {0.905,Dispatch[{1->la,2->présente,<<160350>>,160353->7631,160354->jomac},-<<14>>-]}

This is enough to start working with the data, but in the next section I outline some hints on how to make this work yet more efficient.

Efficient work with uncompressed data

If we try to find, for example, all positions of 2-grams with frequency 1, the naive way is this:

In[8]:= Position[ngrams["Frequencies",3],1,{1}]//Short//Timing
Out[8]= {1.404,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

However, we can leverage the fact that we work with integer indices (instead of words) stored in a packed array. Here is one version of the custom position function (due to Norbert Pozar):

extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]; 
positionExtr[x_List, n_] := 
    extractPositionFromSparseArray[SparseArray[Unitize[x - n], Automatic, 1]]

Using this, we get it 10 times faster (one can use a compiled to C function which is yet twice faster):

In[9]:= positionExtr[ngrams["Frequencies",3],1]//Short//Timing
Out[9]= {0.156,{{870044},{870045},{870046},<<3772583>>,{4642630},{4642631},{4642632}}}

Here are some more convenience functions:

Clear[getNGramsWithFrequency];
getNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=
  Extract[ngramSym["NGrams", n], positionExtr[ngramSym["Frequencies", n], freq]];

Clear[deleteNGramsWithFrequency];
deleteNGramsWithFrequency[{ngrams_List, freqs_List}, freq_Integer] :=  
    Delete[#, positionExtr[freqs, freq]] & /@ {ngrams, freqs};

deleteNGramsWithFrequency[ngramSym_Symbol, n_Integer, freq_Integer] :=  
   deleteNGramsWithFrequency[{ngramSym["NGrams", n], ngramSym["Frequencies", n]}, freq];

Using which, we can get many things pretty efficiently. For example, delete the 2-grams with frequency 1:

In[15]:= deleteNGramsWithFrequency[ngrams,2,1]//Short//Timing
Out[15]= {0.218,{{{793,791},{25,1},{4704,791},<<696333>>,{29,66},{36,37},{18,21}},{<<1>>}}}

Or, 2-grams with frequencies less than 100 (this is sub-optimal way to do this, but it is still pretty fast):

In[17]:= (twogramsLarger100 = 
  Fold[deleteNGramsWithFrequency,deleteNGramsWithFrequency[ngrams,2,1],Range[2,100]])
  //Short//Timing
Out[17]= {0.344,{{{793,791},{25,1},{4704,791},{25,10},<<6909>>,
  {31,623},{402,13},{234,25}},{<<1>>}}}

The main idea is that integer indices play a role of "pointers" for words, and most things can be done with them. When needed, we can return to the normal words:

In[18]:= twogramsLarger100/.wordIndexRules["Inverse"]//Short//Timing
Out[18]= {0.063,{{{of,the},{de,la},<<6912>>,{société,du},{processus,de}},{<<1>>}}}

Concluding remarks

The speed-up achieved here seems substantial. One can control how much RAM is occupied by the data, by loading data in fine-grained chunks. The memory usage itself has been hugely optimized by utilizing packed arrays. The memory savings on disk are due to the combination of Compress and DumpSave. Hash-tables, Dispatch-ed rules and self-uncompressing are techniques used to make it more convenient.

There is plenty of room here for further refinements. One can split data in smaller chunks and compress / save those separately, to avoid high memory usage in intermediate steps. One can also split data according to the frequency ranges, and save the data so split into separate files, to speed up the load / self - uncompress stage. For many files, one would need to generalize this, since global symbols for hashes were used here. This seems a good place to apply some OOP techniques. Generally, this is just a starting point, but my message is that this approach IMO has a good potential for efficient work with such files.

like image 107
Leonid Shifrin Avatar answered Sep 21 '22 11:09

Leonid Shifrin