Imagine you have millions of records containing text with average 2000 words (each), and also you have an other list with about 100000 items. e.g: In the keywords list you a have item like "president Obama" and in one of the text records you have some thing like this: "..... president Obama ....", so i want to find this keyword in the text and replace it with some thing like this: "..... {president Obama} ...." to highlight the keyword in the text, the keywords list contains multi-noun word like the example. What is the fastest way to this in such a huge list with millions of text records? Notes: <ol> <li>Now I do this work in a greedy way, check word by word and match them, but it takes about 2 seconds for each text record, and I want some thing near zero time.</li> <li>Also I know this is something like named-entity-recognition and I worked with many of the NER framework such as Gate and ..., but because I want this for a language which is not supported by the frameworks I want to to this manually.</li> </ol>

Assumptions: Most keywords are single words, but there are som multi word keywords. My suggestion. Hash the keywords based on the first word. So "President","President Obama" and "President Clinton" will all hash to the same value. Then search word-by-word by computing the hashes. On hash matches implement logic to check if you have a match on a multi word keyword. Calculating the hashes will be the most expensive operation of this solution and should be linear in the length of the input string.

As for the exact keyword match: 10^6 * 2*10^3 words = billions of possible matches. Comparing this with 10^5 possible matches leads to over 10^6 * 2^3 * 10^5 = 2 * 10^14 operations (worst case: no match, probability no-match: big (because 100000 is small compared all possible words?). <code>and i want some thing near zero time</code> Not possible. As for the NER, you must drop the keywords list and classify the grammar in categories you would like to highlight. Things like: <ul> <li>verbs</li> <li>adverbs</li> <li>nouns</li> <li>names</li> <li>quantities</li> <li>etc. </li> </ul> can be identified. After you have done that, you could define a special list containing special words by category. E.g.: <code>President</code> might be in such a (noun) list to highlight it with special properties. Because you'll end up with a much smaller <code>special list</code>, spitted into several <code>catagories</code>. You can decrease the number of operations needed. (Just reallize, as you know all about NER you already know that.) So,you could extract a NER like logic (or other non 100% match algorithm) for the language you're targeting. Another try might be: Put all your keywords in a hashtable or other (indexed) dictionary, check if the targeted word is existing in that hashtable. As it is indexed, it will be significant faster than the regular matching. You can store additional info for the keyword in the hashtable.

What is the best way to match substring from a big string to a huge list of keywords

Tags:

string-matching

c#

regex

lookup

named-entity-recognition

Imagine you have millions of records containing text with average 2000 words (each), and also you have an other list with about 100000 items.

e.g: In the keywords list you a have item like "president Obama" and in one of the text records you have some thing like this: "..... president Obama ....", so i want to find this keyword in the text and replace it with some thing like this: "..... {president Obama} ...." to highlight the keyword in the text, the keywords list contains multi-noun word like the example.

What is the fastest way to this in such a huge list with millions of text records?

Notes:

Now I do this work in a greedy way, check word by word and match them, but it takes about 2 seconds for each text record, and I want some thing near zero time.
Also I know this is something like named-entity-recognition and I worked with many of the NER framework such as Gate and ..., but because I want this for a language which is not supported by the frameworks I want to to this manually.

756

asked Nov 26 '13 07:11

Reza M.A

2 Answers

Assumptions: Most keywords are single words, but there are som multi word keywords.

My suggestion.

Hash the keywords based on the first word. So "President","President Obama" and "President Clinton" will all hash to the same value.

Then search word-by-word by computing the hashes. On hash matches implement logic to check if you have a match on a multi word keyword.

Calculating the hashes will be the most expensive operation of this solution and should be linear in the length of the input string.

159

answered Oct 18 '22 20:10

Taemyr

As for the exact keyword match:

10^6 * 2*10^3 words = billions of possible matches. Comparing this with 10^5 possible matches leads to over 10^6 * 2^3 * 10^5 = 2 * 10^14 operations (worst case: no match, probability no-match: big (because 100000 is small compared all possible words?).

and i want some thing near zero time

Not possible.

As for the NER, you must drop the keywords list and classify the grammar in categories you would like to highlight.

Things like:

verbs
adverbs
nouns
names
quantities
etc.

can be identified. After you have done that, you could define a special list containing special words by category. E.g.: President might be in such a (noun) list to highlight it with special properties. Because you'll end up with a much smaller special list, spitted into several catagories. You can decrease the number of operations needed.

(Just reallize, as you know all about NER you already know that.)

So,you could extract a NER like logic (or other non 100% match algorithm) for the language you're targeting.

Another try might be:

Put all your keywords in a hashtable or other (indexed) dictionary, check if the targeted word is existing in that hashtable. As it is indexed, it will be significant faster than the regular matching. You can store additional info for the keyword in the hashtable.

answered Oct 18 '22 20:10

Stefan

Related questions
                            
                                Make Entity Framework use Contains instead of Like and explain 'ESCAPE ~'
                            
                                Generating a new SessionID on Login (ASP.NET)
                            
                                Ensuring that outgoing WCF requests are performed using a specific network interface
                            
                                HttpClient.DeleteAsync and Content.ReadAdStringAsync always return null
                            
                                Accessing properties of an anonymous types in C#?
                            
                                Can I insert a large text value into SQL Server from ASP.net without having the whole file in memory on the webserver?
                            
                                WP8 keyboard handling
                            
                                Git cant diff or merge .cs file in utf-16 encoding
                            
                                Putting a guard on a WPF event trigger. Is this possible?
                            
                                How to program Intel Xeon Phi with C#? [closed]
                            
                                How to Deploy C# .net application with MongoDB
                            
                                Can Process.HasExited be true for the current process?
                            
                                Why there is no Nullable<T>.Equals(T value) method? [closed]
                            
                                Windows forward packets to c# application
                            
                                Dependency property changed callback - multiple firing
                            
                                I need to compare two very large collections with potentially missing elements
                            
                                Can you pre-compress data files to be inserted into a zip file at a later time to improve performance?
                            
                                HTMLAgilityPack - You need to set UseIdAttribute property to true to enable this feature
                            
                                enable Windows Authentication in Windows 8.1
                            
                                An exception of type 'System.Net.WebException' occurred in System.Windows.ni.dll

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With