Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize a Regex

I'm using the following code to discard unsupported physical interfaces / subinterfaces from routers that connects to a big ISP network (by big I mean tens of thousands of routers):

private final static Pattern INTERFACES_TO_FILTER = 
   Pattern.compile("unrouted VLAN|GigabitEthernet.+-mpls layer|FastEthernet.+-802\\.1Q vLAN subif"); 

// Simplification
List<String> interfaces;
// lots of irrelevant code to query the routers 

for (String intf : interfaces) {
   if (INTERFACES_TO_FILTER.matcher(intf).find()) {
      // code to prevent the interface from being used
   } 
}

The idea is discarding entries such as:

  • unrouted VLAN 2000 for GigabitEthernet2/11.2000
  • GigabitEthernet1/2-mpls layer
  • FastEthernet6/0/3.2000-802.1Q vLAN subif

This code is hit often enough (several times per minute) over huge sets of interfaces (some routers have 50k+ subintefaces), cache doesn't really help much either because new subinterfaces are being configured / discarded very often. The plan is to optimize the regex so that the procedure completes a tad faster (every nanosecond counts). Can you guys enlighten me?

Note: mpls layer and 802.1Q are supported for other kinds of interfaces, unrouted VLANs isn't.

like image 927
Anthony Accioly Avatar asked May 07 '26 13:05

Anthony Accioly


2 Answers

There are some string search algorithms that allow you to try to search in a string of length n for k strings at once cheaper than the obvious O(n*k) cost.

They usually compare a rolling hash against a list of existing hashes of your words. A prime example of this would be the Rabin-Karp algorithm. The wiki page even has a section about this. There are more advanced versions of the principle out there as well, but it's easy to understand the principle.

No idea if there already are libraries in Java that do this (I'd think so), but that's what I'd try - although 5 strings is rather small here (and different size makes it more complex too). So better check whether a good KMP string search isn't faster - I'd think that'd be by far the best solution really (the default java api uses a naive string search, so use a lib)

About your regexes: backtracking regex implementation for performance critical search code? I doubt that's a good idea.

PS: If you'd post a testset and a test harness for your problem, chances are good people would see how much they could beat the favorite - has worked before.. human nature is so easy to trick :)

like image 121
Voo Avatar answered May 09 '26 01:05

Voo


I'm answering my own question for further reference, although the credits goes to @piotrekkr since he was the one that pointed the way. Also my Kudos to @JB and @ratchet. I ended up using matches(), and the logic using indexOf and several contains was almost as fast (that's news to me, I always assumed that a single regex would be faster than several calls to contains).

Here's a solution that is several times faster (according to the profiler, about 7 times less time is spent at Matcher class methods):

^(?:unrouted VLAN.++|GigabitEthernet.+?-mpls layer|FastEthernet.+?-802\\.1Q vLAN subif)$
like image 21
Anthony Accioly Avatar answered May 09 '26 03:05

Anthony Accioly



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!