Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx for hashtag separated string

Tags:

regex

php

I have bunch of strings like this:

a#aax1aay222b#bbx4bby555bbz6c#mmm1d#ara1e#abc

And what I need to do is to split them up based on the hashtag position to something like this:

Array
(
    [0] => A
    [1] => AAX1AAY222
    [2] => B
    [3] => BBX4BBY555BBZ6
    [4] => C
    [5] => MMM1
    [6] => D
    [7] => ARA1
    [8] => E
    [9] => ABC
)

So, as you see the character right behind the hashtag is captured plus everything after the hashtag just right before the next char+hashtag.

I've the following RegEx which works fine only when I have a numeric value in the end of each part.

Here is the RegEx set up:

preg_split('/([A-Z])+#/', $text, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

And it works fine with something like this:

C#mmm1D#ara1

But, if I change it to this (removing the numbers):

C#mmmD#ara

Then it will be the result, which is not good:

    Array
(
    [0] => C
    [1] => D
)

I've looked at this question and this one also, which are similar but none of them worked for me.

So, my question is why does it work only if it has followed by a number? and how I can solve it?

Here you can see some of them sample strings which I have:

a#123b#abcc#def456         // A:123, B:ABC, C:DEF456
a#abc1def2efg3b#abcdefc#8  // A:ABC1DEF2EFG3, B:ABCDEF, C:8
a#abcdef123b#5c#xyz789     // A:ABCDEF123, B:5, C:XYZ789

P.S. Strings are case-insensitive.

P.P.S. If you ever thinking what the hell are these strings, they are user submitted answers to a questionnaire, and I can't do anything on them like refactoring as they are already stored and just need to be proceed.

Why Not Using explode?

If you look at my examples you will see that I need to capture the character right before the # as well. If you think it's possible with explode() please post the output as well, thanks!

Update

Should we focus on why /([A-Z])+#/ works only if numbers included? thanks.

like image 893
Mahdi Avatar asked May 16 '13 07:05

Mahdi


2 Answers

Instead of using preg_split(), decide what you want to match instead:

  1. A set of "words" if followed by either <any-char># or <end-of-string>.

  2. A character if immediately followed by #.

    $str = 'a#aax1aay222b#bbx4bby555bbz6c#mmm1d#ara1e#abc';
    
    preg_match_all('/\w+(?=.#|$)|\w(?=#)/', $str, $matches);
    

Demo

This expression uses two look-ahead assertions. The results are in $matches[0].

Update

Another way of looking at it would be this:

preg_match_all('/(\w)#(\w+)(?=\w#|$)/', $str, $matches);

print_r(array_combine($matches[1], $matches[2]));

Each entry starts with a single character, followed by a hash, followed by X characters until either the end of the string is encountered or the start of a next entry.

The output is this:

Array
(
    [a] => aax1aay222
    [b] => bbx4bby555bbz6
    [c] => mmm1
    [d] => ara1
    [e] => abc
)
like image 187
Ja͢ck Avatar answered Oct 19 '22 02:10

Ja͢ck


If you still want to use preg_split you can remove the + and it might work as expected:

'/([A-Z])#/i'

Since then you only match the hashtag and ONE alpha character before, and not all them.

Example: http://codepad.viper-7.com/z1kFDb

Edit: Added a case-insensitive flag i in the pattern.

like image 4
Marcus Avatar answered Oct 19 '22 02:10

Marcus