Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Automatically built regex expressions that fit set of strings

We have written the system to analyse log messages from the large network. The system takes log messages from lots of different network elements, and analyses it by regex expressions. For example user may have written two rules:

^cron/script\.sh.*
.*script\.sh [0-9]+$

In this case only logs that match given patterns will be selected. The reason of the filtering is that there may be really lots of log messages, up to 1 GB per day.

Now the main part of my question. Since there is lots of network elements, and several types of them, and every one of them has different parameters in path... Is there any way to automatically generate set of regexes that will somehow group the logs? The system can learn on historical data, e.g. from the last week. Generated regex must not be very accurate, it is supposed to be the hint for user to add such new rule into system.

I was thinking about unsupervised machine learning to divide input into groups and then in each group find proper regex. Is there any other way, maybe faster or better? And, last but not least, how to find regex that matches all strings in obtained group? (Non-trivial, so .* is not the answer.)


Edit After some thinking I'll try to simplify the problem. Suppose I have already grouped logs. I'd like to find (at most) three largest substrings (at least one) common to all the strings in set. For example:

Set of strings:
cron/script1.sh -abc 1243 all
cron/script2.sh 1
bin/script1.sh -asdf 15

Obtained groups:
/script
.sh 

Now I can build some simple regex by concatenating these groups with .*?. In this example it would be .*?(/script).*?(\.sh ).*?. It seems to be simpler solution.

like image 840
Archie Avatar asked Oct 06 '11 11:10

Archie


People also ask

How do you create expressions in regex?

If you want to match for the actual '+', '. ' etc characters, add a backslash( \ ) before that character. This will tell the computer to treat the following character as a search character and consider it for matching pattern. Example : \d+[\+-x\*]\d+ will match patterns like "2+2" and "3*9" in "(2+2) * 3*9".

What does \d mean in regex?

In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit. The above regex matches two words (without white spaces) separated by one or more whitespaces.

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

What does (? I do in regex?

(? i) makes the regex case insensitive. (? c) makes the regex case sensitive.


2 Answers

You could try the tool hosted at this site: http://regex.inginf.units.it/

This tool automatically generates a regex from a set of examples, so it should be perfect for your use case. In the website it is also described how it works in details (it is based on genetic programming).

like image 60
Marco Mauri Avatar answered Nov 03 '22 13:11

Marco Mauri


OK, we'll try to break this down into manageable steps.

  1. For each substring w in s1, in order of non-increasing length,
  2.  assume w is a substring of the other sM
  3.  for each string of the other sN,
  4.   if w is not a substring of sN, disprove assumption and break
  5.  if the assumption held, save w
  6.  if you've found three w that work, break
  7. You have recorded between 0 and 3 w that work.

Note that not all sets of strings are guaranteed to have common substrings (except the empty string). In the worst case, assume s1 is the longest string. There are O(n^2) substrings of s1 (|s1| = n) and it takes O(n) to compare to each of m other strings... so the asymptotic complexity is, I believe, O(n^2 * nm)... even though the algorithm is naive, this should be pretty manageable (polynomial, after all, and quadratic at that).

The transformation to e.g. C code should be straightforward... use a sliding window with a decrementing length loop to get substrings of s1, and then use linear searchers to find matches in the other strings.

I'm sure there are smarter / asymptotically better ways of doing this, but any algorithm will have to look at all characters in all strings, so O(nm)... may not be completely right here.

like image 27
Patrick87 Avatar answered Nov 03 '22 13:11

Patrick87