Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm for finding all keywords in a text

I have lots of strings containing text in lots of different spellings. I am tokenizing these strings by searching for keywords and if a keyword is found I use an assoicated text for that keyword.

Let's say the search string can contain the text "schw.", "schwa." and "schwarz". I have three keywords that all resolve to the text "schwarz".

Now I'm searching for an effective way to find all the keywords without doing a string.Contains(keyword) for every single keyword.

Sample data:

H-Fuss ahorn 15 cm/SH48cm
Metall-Fuss chrom 9 cm/SH42cm
Metall-Kufe alufbg.12 cm/SH45c
Metall-Kufe verchr.12 cm/SH45c
Metall-Zylind.aluf.12cm/SH45cm
Kufe alufarbig
Metall-Zylinder hoch alufarbig
Kunststoffgl.schw. - hoch
Kunststoffgl.schw. - Standard
Kunststoffgleiter - schwarz für Sitzhoehe 42 cm

Sample keywords (key, value):

h-fuss, Holz
ahorn, Ahorn
metall, Metall
chrom, Chrom
verchr, Chrom
alum, Aluminium
aluf, Aluminium
kufe, Kufe
zylind, Zylinder
hoch, Hoch
kunststoffgl, Gleiter
gleiter, Gleiter
schwarz, Schwarz
schw., Schwarz

Sample result:

Holz, Ahorn
Metall, Chrom
Metall, Kufe, Aluminium
Metall, Kufe, Chrom
Metall, Zylinder, Aluminium
Kufe, Aluminium
Metall, Zylinder, Hoch, Aluminium
Gleiter, Schwarz, Hoch
Gleiter, Schwarz
Gleiter, Schwarz
like image 230
VVS Avatar asked Nov 18 '10 11:11

VVS


People also ask

Which is the best keyword extraction algorithm?

The TF–IDF algorithm is a classic keyword extraction method [14], which mainly evaluates the importance of a word or a phrase to the text. The importance is related to two factors, TF and IDF. TF refers to the frequency of a word appearing in the document; the higher the frequency is, the more important the word is.

What is keyword search algorithm?

Abstract: Search engines prominently use inverted indexing technique to locate the Web pages having the keyword contained in the users query. The performance of inverted index, fundamentally, depends upon the searching of keyword in the list maintained by search engines.

What is keyword matching algorithm?

Single keyword pattern matching algorithms are used to find all occurrences of a specific keyword in a given input. Due to the size of the keyword set is one these types of algorithms will be very limited in modern Network Security Systems.


1 Answers

This seems to fit "Algorithms using finite set of patterns"

The Aho–Corasick string matching algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick. It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns "at once", so the complexity of the algorithm is linear in the length of the patterns plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).

The Rabin–Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find any one of a set of pattern strings in a text. For text of length n and p patterns of combined length m, its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm). In contrast, the Aho–Corasick string matching algorithm has asymptotic worst-time complexity O(n+m) in space O(m).

like image 93
The Archetypal Paul Avatar answered Oct 19 '22 13:10

The Archetypal Paul