Automatically built regex expressions that fit set of strings

Tags:

We have written the system to analyse log messages from the large network. The system takes log messages from lots of different network elements, and analyses it by regex expressions. For example user may have written two rules:

^cron/script\.sh.*
.*script\.sh [0-9]+$

In this case only logs that match given patterns will be selected. The reason of the filtering is that there may be really lots of log messages, up to 1 GB per day.

Now the main part of my question. Since there is lots of network elements, and several types of them, and every one of them has different parameters in path... Is there any way to automatically generate set of regexes that will somehow group the logs? The system can learn on historical data, e.g. from the last week. Generated regex must not be very accurate, it is supposed to be the hint for user to add such new rule into system.

I was thinking about unsupervised machine learning to divide input into groups and then in each group find proper regex. Is there any other way, maybe faster or better? And, last but not least, how to find regex that matches all strings in obtained group? (Non-trivial, so .* is not the answer.)

Edit After some thinking I'll try to simplify the problem. Suppose I have already grouped logs. I'd like to find (at most) three largest substrings (at least one) common to all the strings in set. For example:

Set of strings:
cron/script1.sh -abc 1243 all
cron/script2.sh 1
bin/script1.sh -asdf 15

Obtained groups:
/script
.sh

Now I can build some simple regex by concatenating these groups with .*?. In this example it would be .*?(/script).*?(\.sh ).*?. It seems to be simpler solution.

840

asked Oct 06 '11 11:10

Archie

2 Answers

You could try the tool hosted at this site: http://regex.inginf.units.it/

This tool automatically generates a regex from a set of examples, so it should be perfect for your use case. In the website it is also described how it works in details (it is based on genetic programming).

answered Nov 03 '22 13:11

Marco Mauri

OK, we'll try to break this down into manageable steps.

  1. For each substring w in s1, in order of non-increasing length,
  2.  assume w is a substring of the other sM
  3.  for each string of the other sN,
  4.   if w is not a substring of sN, disprove assumption and break
  5.  if the assumption held, save w
  6.  if you've found three w that work, break
  7. You have recorded between 0 and 3 w that work.

Note that not all sets of strings are guaranteed to have common substrings (except the empty string). In the worst case, assume s1 is the longest string. There are O(n^2) substrings of s1 (|s1| = n) and it takes O(n) to compare to each of m other strings... so the asymptotic complexity is, I believe, O(n^2 * nm)... even though the algorithm is naive, this should be pretty manageable (polynomial, after all, and quadratic at that).

The transformation to e.g. C code should be straightforward... use a sliding window with a decrementing length loop to get substrings of s1, and then use linear searchers to find matches in the other strings.

I'm sure there are smarter / asymptotically better ways of doing this, but any algorithm will have to look at all characters in all strings, so O(nm)... may not be completely right here.

answered Nov 03 '22 13:11

Patrick87

Related questions
                            
                                Any difference between m and rx?
                            
                                How to use regex in Bigquery
                            
                                Why should we use re.purge() in python regular expression?
                            
                                Matching non-whitespace characters in Perl 6
                            
                                perl6 regex: match all punctuations except . and "
                            
                                Regex for extracting names starting with Mr.|Mrs|The|DR after honorable
                            
                                Regular Expression to escape HTML ampersands while respecting CDATA
                            
                                Javascript regular expressions - exec infinite loop
                            
                                Javascript Regex to convert dot notation to bracket notation
                            
                                How do I find {min,max} repeats with regular expression patterns in Visual Studio or SSMS "Find and Replace"?
                            
                                What is the preferred way to filter a regex search for duplicate matches in C#
                            
                                Is there a lib to generate data according to a regexp? (Python or other)
                            
                                regular expression to detect numbers written as words
                            
                                Javascript REGEX: How to get `1` and not `11`
                            
                                Regex to match on capital letter, digit or capital, lowercase, and digit
                            
                                Ruby Koans - Regex and .sub: Don't understand reason behind answer
                            
                                How can Python regex ignore case inside a part of a pattern but not the entire expression? [duplicate]
                            
                                Can you retrieve multiple regex matches in JavaScript?
                            
                                Regular Expressions C++ Qt
                            
                                How to remove ETX character from the end of a string? (Regex or PHP)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Automatically built regex expressions that fit set of strings

Tags:

string

regex

algorithm

Archie

People also ask

2 Answers

Marco Mauri

Patrick87

Recent Activity

Donate For Us