Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression search engine [closed]

Is there a search engine, that would allow me to search by a regular expression?

like image 354
Elwhis Avatar asked Jan 01 '11 13:01

Elwhis


People also ask

Are regular expressions used in search engines?

Regular expressions are used in search engines, in search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK, and in lexical analysis.

What are closures in regex?

Closure properties on regular languages are defined as certain operations on regular language which are guaranteed to produce regular language. Closure refers to some operation on a language, resulting in a new language that is of same “type” as originally operated on i.e., regular.

What is '?' In regular expression?

means "zero or one digits, but not two or more". [0-9]* means "zero or more digits (no limit, could be 42 of them)". Note that some languages require that floats are written with a leading 0 before the .


3 Answers

Google Code Search allows you to search using a regular expression.

As far as I am aware no such search engine exists for general searches.

like image 155
Mark Byers Avatar answered Oct 06 '22 19:10

Mark Byers


There are a few problems with regular expressions that current prohibit employing these in real-world scenarios. The most pressing would be that the entire cached Internet would have to be matched with your regex, which would take significant computing resources; indexes are pretty much useless in regex context it seems, due to regexes being potentially unbound (/fo*bar/).

like image 42
user502515 Avatar answered Oct 06 '22 18:10

user502515


I don't have a specific engine to suggest.

However, if you could live with a subset of regex syntax, a search engine could store additional tokens to efficiently match rather complex expressions. Solr/Lucene allows for custom tokenization, where the same word can generate multiple tokens and with various rule sets.

I'll use my name as an example: "Mark marks the spot."

Case insensitive with stemming: (mark, mark, spot)

Case sensitive with no stemming: (Mark, marks, spot)

Case sensitive with NLP thesaurus expansion: ( [Mark, Marc], [mark, indicate, to-point], [spot, position, location, beacon, coordinate] )

And now evolving towards your question, case insensitive, stemming, dedupe, autocomplete prefix matching: ( [m, ma, mar, mark], [s, sp, spo, spot] )

And if you wanted "substring" style matching it would be: ( [m, ma, mar, mark, a, ar, ark, r, rk, k], [s, sp, spo, spot, p, po, pot, o, ot, t] )

A single search Index contain all of these different forms of tokens, and choose which ones to use for each type of search.

Let's try the word "Missippi" with a regex style with literal tokens: [ m, m?, m+, i, i?, i+, s, ss, s+, ss+ ... ] etc.

The actual rules would depend on the regex subset, but hopefully the pattern is becoming clearer. You would extend even further to match other regex fragments, and then use a form of phrase searching to locate matches.

Of course the index would be quite large, BUT it might be worth it, depending on the project's requirements. And you'd also need a query parser and application logic.

I realize if you're looking for a canned engine this doesn't do it, but in terms of theory this is how I'd approach it (assuming it's really a requirement!). If all somebody wanted was substring matching and flexible wildcard matching, you could get away with far fewer tokens in the index.

In terms of canned apps, you might check out OpenGrok, used for source code indexing, which is not full regex, but understands source code pretty well.

like image 43
Mark Bennett Avatar answered Oct 06 '22 20:10

Mark Bennett