Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding dictionary words

I have a lot of compound strings that are a combination of two or three English words.

    e.g. "Spicejet" is a combination of the words "spice" and "jet"

I need to separate these individual English words from such compound strings. My dictionary is going to consist of around 100000 words.

What would be the most efficient by which I can separate individual English words from such compound strings.

like image 381
Manas Avatar asked Aug 18 '09 04:08

Manas


2 Answers

I'm not sure how much time or frequency you have to do this (is it a one-time operation? daily? weekly?) but you're obviously going to want a quick, weighted dictionary lookup.

You'll also want to have a conflict resolution mechanism, perhaps a side-queue to manually resolve conflicts on tuples that have multiple possible meanings.

I would look into Tries. Using one you can efficiently find (and weight) your prefixes, which are precisely what you will be looking for.

You'll have to build the Tries yourself from a good dictionary source, and weight the nodes on full words to provide yourself a good quality mechanism for reference.

Just brainstorming here, but if you know your dataset consists primarily of duplets or triplets, you could probably get away with multiple Trie lookups, for example looking up 'Spic' and then 'ejet' and then finding that both results have a low score, abandon into 'Spice' and 'Jet', where both Tries would yield a good combined result between the two.

Also I would consider utilizing frequency analysis on the most common prefixes up to an arbitrary or dynamic limit, e.g. filtering 'the' or 'un' or 'in' and weighting those accordingly.

Sounds like a fun problem, good luck!

like image 51
jscharf Avatar answered Nov 15 '22 04:11

jscharf


If the aim is to find the "the largest possible break up for the input" as you replied, then the algorithm could be fairly straightforward if you use some graph theory. You take the compound word and make a graph with a vertex before and after every letter. You'll have a vertex for each index in the string and one past the end. Next you find all legal words in your dictionary that are substrings of the compound word. Then, for each legal substring, add an edge with weight 1 to the graph connecting the vertex before the first letter in the substring with the vertex after the last letter in the substring. Finally, use a shortest path algorithm to find the path with fewest edges between the first and the last vertex.

The pseudo code is something like this:

parseWords(compoundWord)
    # Make the graph
    graph = makeGraph()
    N = compoundWord.length
    for index = 0 to N
        graph.addVertex(i)

    # Add the edges for each word
    for index = 0 to N - 1
        for length = 1 to min(N - index, MAX_WORD_LENGTH)
            potentialWord = compoundWord.substr(index, length)
            if dictionary.isElement(potentialWord)
                graph.addEdge(index, index + length, 1)

    # Now find a list of edges which define the shortest path
    edges = graph.shortestPath(0, N)

    # Change these edges back into words.
    result = makeList()
    for e in edges
        result.add(compoundWord.substr(e.start, e.stop - e.start + 1))
    return result

I, obviously, haven't tested this pseudo-code, and there may be some off-by-one indexing errors, and there isn't any bug-checking, but the basic idea is there. I did something similar to this in school and it worked pretty well. The edge creation loops are O(M * N), where N is the length of the compound word, and M is the maximum word length in your dictionary or N (whichever is smaller). The shortest path algorithm's runtime will depend on which algorithm you pick. Dijkstra's comes most readily to mind. I think its runtime is O(N^2 * log(N)), since the max edges possible is N^2.

You can use any shortest path algorithm. There are several shortest path algorithms which have their various strengths and weaknesses, but I'm guessing that for your case the difference will not be too significant. If, instead of trying to find the fewest possible words to break up the compound, you wanted to find the most possible, then you give the edges negative weights and try to find the shortest path with an algorithm that allows negative weights.

like image 23
A. Levy Avatar answered Nov 15 '22 02:11

A. Levy