Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find most repeated phrase on huge text

Tags:

I have huge text data. My entire database is text format in UTF-8

I need to have list of most repeated phrase on my whole text data.

For example my desire output something like this:

{
  'a': 423412341,
  'this': 423412341,
  'is': 322472341,
  'this is': 222472341,
  'this is a': 122472341,
  'this is a my': 5235634
}

Process and store each phrase take huge size of database. For example store in MySQL or MongoDB. Question is is there any more efficient database or alghorithm for find this result ? Solr, Elasticsearch or etc ...

I think i have max 10 words in each phrase can be good for me.

like image 482
sweb Avatar asked Apr 20 '15 16:04

sweb


1 Answers

I'd suggest combining ideas from two fields, here: Streaming Algorithms, and the Apriori Algorithm From Market-Basket Analysis.

  1. Let's start with the problem of finding the k most frequent single words without loading the entire corpus into memory. A very simple algorithm, Sampling (see Finding Frequent Items in Data Streams]), can do so very easily. Moreover, it is very amenable to parallel implementation (described below). There is a plethora of work on top-k queries, including some on distributed versions (see, e.g., Efficient Top-K Query Calculation in Distributed Networks).

  2. Now to the problem of k most frequent phrases (of possibly multiple phrases). Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity. Hence, once you have the k most frequent single words, you can scan the corpus for only them (which is faster) to build the most frequent phrases of length 2. Using this, you can build the most frequent phrases of length 3, and so on. The stopping condition is when a phrase of length l + 1 does not evict any phrase of length l.


A Short Description of The Sampling Algorithm

This is a very simple algorithm which will, with high probability, find the top k items out of those having frequency at least f. It operates in two stages: the first finds candidate elements, and the second counts them.

In the first stage, randomly select ~ log(n) / f words from the corpus (note that this is much less than n). With high probability, all your desired words appear in the set of these words.

In the second stage, maintain a dictionary of the counts of these candidate elements; scan the corpus, and count the occurrences.

Output the top k of the items resulting from the second stage.

Note that the second stage is very amenable to parallel implementation. If you partition the text into different segments, and count the occurrences in each segment, you can easily combine the dictionaries at the end.

like image 170
Ami Tavory Avatar answered Oct 02 '22 19:10

Ami Tavory