Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Very complicated regular expression

I've been stuck trying to write this regular expression I need. Basically, I have a long string comprised of two different types of data:

  1. [a-f0-9]{32}
  2. [a-zA-Z0-9=]{x}

The thing is, x is only constant in the particular instance: if in one case, it happens to be 12, it will be 12 for that particular dataset, but next time I run the regular expression it might need to be 15, or 45 for example. I have an unpredictable number of type (1)'s between each piece of type (2). My goal is to "harvest" all the data of type (2).

For example I could have a string of the following form:

[a-f0-9]{192}
[a-zA-Z0-9=]{11}
[a-f0-9]{96}
[a-zA-Z0-9=]{11}
[af-0-9]{160}
[a-zA-Z0-9=]{11}

(All put together with no delimitations). I need it to return a string comprised of the 33 characters of the [a-zA-Z0-9=] character set. The fact that the number of characters in each of the substrings is constant in the instance (in the case above it was 11, but it could just have easily have been 13) is vital as since it contains the smaller character set it would otherwise be impossible to know where one string begins and the other ends.

I've been trying to get this to work for almost a month now, and I'm close to tearing out my hair. I'm not particularly good at regular expressions...

Example data:

3c21e03a10b9415fb3e1067ea75f8205
c8dc9900a5089d31e01241c7a947ed7e
d5f8cd6bb86ebef6d7d104c84ae6e8a7
e23c99af9c9d6d0294d8b51094c39021
4bb4af7e61760735ba17c29e8f542a66
875da91e90863f1ddb7e149297fc59af
cf5de951fb65d06d2927aab7b9b54830
e2d935616a54c381c2f38db3731d5a37
SGVsbG8gbXk
6dd11d15c419ac219901f14bdd999f38
0ad94e978ad624d15189f5230e5435a9
2dc19fe95e583e7d593dd52ae7e68a6e
465ffa6074a371a8958dad3ad271181a
23310939b981b4e56f2ecee26f82ec60
fe04bef49be47603d1278cc80673b226
gbmFtZSBpcy
3c21e03a10b9415fb3e1067ea75f8205
c8dc9900a5089d31e01241c7a947ed7e
d5f8cd6bb86ebef6d7d104c84ae6e8a7
e23c99af9c9d6d0294d8b51094c39021
BvbGl2ZXIga
4bb4af7e61760735ba17c29e8f542a66
875da91e90863f1ddb7e149297fc59af
cf5de951fb65d06d2927aab7b9b54830
e2d935616a54c381c2f38db3731d5a37
G9vcmF5IQ==

I would want to extract "SGVsbG8gbXkgbmFtZSBpcyBvbGl2ZXIgaG9vcmF5IQ==".

like image 207
Mala Avatar asked Dec 28 '09 13:12

Mala


4 Answers

It's your lucky day! The problem is not solvable in general, but I believe that the following will nearly always give the right answer for typical data from real life:

<?php

$s = '
3c21e03a10b9415fb3e1067ea75f8205
c8dc9900a5089d31e01241c7a947ed7e
d5f8cd6bb86ebef6d7d104c84ae6e8a7
e23c99af9c9d6d0294d8b51094c39021
4bb4af7e61760735ba17c29e8f542a66
875da91e90863f1ddb7e149297fc59af
cf5de951fb65d06d2927aab7b9b54830
e2d935616a54c381c2f38db3731d5a37
SGVsbG8gbXk
6dd11d15c419ac219901f14bdd999f38
0ad94e978ad624d15189f5230e5435a9
2dc19fe95e583e7d593dd52ae7e68a6e
465ffa6074a371a8958dad3ad271181a
23310939b981b4e56f2ecee26f82ec60
fe04bef49be47603d1278cc80673b226
gbmFtZSBpcy
3c21e03a10b9415fb3e1067ea75f8205
c8dc9900a5089d31e01241c7a947ed7e
d5f8cd6bb86ebef6d7d104c84ae6e8a7
e23c99af9c9d6d0294d8b51094c39021
BvbGl2ZXIga
4bb4af7e61760735ba17c29e8f542a66
875da91e90863f1ddb7e149297fc59af
cf5de951fb65d06d2927aab7b9b54830
e2d935616a54c381c2f38db3731d5a37
G9vcmF5IQ==
';
$s = preg_replace('/\r?\n/', '', $s);

for ($i = 1; $i < 20; ++$i) {
    $pattern = "/^(([a-f0-9]{32})+([a-zA-Z0-9=]{{$i}})?)+$/";
    if (preg_match($pattern, $s)) {
        $pattern = "/(?:[a-f0-9]{32})+([a-zA-Z0-9=]{{$i}})/";
        $matches = array();
        preg_match_all($pattern, $s, $matches);
        print_r(join('', $matches[1]));
        break;
    }
}

Output in this case:

SGVsbG8gbXkgbmFtZSBpcyBvbGl2ZXIgaG9vcmF5IQ==

I believe that the code could be improved, but I'm sure you're just happy to get something that works. I think this is similar to the "bazooka" method you described above, but I honestly don't think there is a better way. Note also that it is important to start with small guesses first to minimize the chance of false matches. The order of the terms in the regex is also important to increase the likelihood of choosing correctly when more than one choice is possible (try hardest to match first, greedy, then easiest match only if that fails).

like image 139
Mark Byers Avatar answered Sep 28 '22 02:09

Mark Byers


I do not believe that regular expressions are the right tool for this problem.

One thing that bothers me is that the range [a-f0-9] is included in the range [a-zA-Z0-9=] and since there are no delimiters and the length of the records is variable, the boundary between two records seems pretty fuzzy.

You may have a heuristic that works to determine where the records start and end by finding a pattern in the data, and you may then apply regular expressions using this pattern, but it is unlikely that regular expressions will help you to uncover this pattern in the first place.

like image 34
Eric Bréchemier Avatar answered Sep 28 '22 00:09

Eric Bréchemier


I don't think your "types" of data are well-defined enough to make the problem solvable for all cases, irrespective of whether you use regular expressions at all.

Since, judging from your example, type 1 can occur multiple times in a row, and type 2 can look like type 1 since the character sets overlap, I don't see how you can tell them apart for all cases even when you know X (which, judging from the question, I'm not sure you do).

As a primitive example, given a string of 2000 repetitions of the letter "a", how could you tell apart types 1 and 2?

If there is any possibility at all of having whatever is giving you that data put in explicit delimiters, do that. Otherwise, you'll have to use heuristics to disambiguate, and I don't think a regexp is the right tool for that.

like image 41
Michael Borgwardt Avatar answered Sep 28 '22 00:09

Michael Borgwardt


It seems that the data you are parsing from between the hex strings is Base64. The actual problem you are describing seems unsolvable with the restrictions you have given (cannot assume any lengths etc.).

But the big thing you should be aware is that base64 character set also contains the characters '+' and '/'. The '=' characters are padding as the length of the entire (in your case, concatenated) base64 encoded bit is always an even multiple of 4 characters.

like image 44
Nakedible Avatar answered Sep 28 '22 02:09

Nakedible