Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl split function - use repeating characters as delimiter

Tags:

regex

perl

I want to split a string using repeating letters as delimiter, for example, "123aaaa23a3" should be split as ('123', '23a3') while "123abc4" should be left unchanged.
So I tried this:

@s = split /([[:alpha:]])\1+/, '123aaaa23a3';

But this returns '123', 'a', '23a3', which is not what I wanted. Now I know that this is because the last 'a' in 'aaaa' is captured by the parantheses and thus preserved by split(). But anyway, I can't add something like ?: since [[:alpha:]] must be captured for back reference. How can I resolve this situation?

like image 741
AaronS Avatar asked Sep 21 '15 03:09

AaronS


People also ask

What are the examples of split function in Perl?

Below is the example of split function in perl are as follows. 1. Splitting on Character Please find below example to split string using character. In the below example we have splitting string on character basis. We have splitting using comma. We have splitted number of character string into multiple sting.

How do you split a string with multiple delimiters in Perl?

Perl split on Multiple Characters We can split a character at more than one delimiter. In the following example, we have split the string at (=) and (,). my $str = 'Vishal=18Sept,Anu=11May,Juhi=5Jul';

What is the difference between Split and join character in Perl?

There is an empty string, between every two characters. It means it will return the original string split into individual characters. Perl join character, joins elements into a single string using a delimiter pattern to separate each element. It is opposite of split.

What is a field delimiter in Perl?

When writing Perl programs, many "data" files you end up working with are really plain text files that use some kind of character to act as a field delimiter. For instance, a program I was working with recently reads data from files whose fields are separated by the pipe character ("|"). In Perl programs these files are easy to work with.


2 Answers

Hmm, its an interesting one. My first thought would be - your delimiter will always be odd numbers, so you can just discard any odd numbered array elements.

Something like this perhaps?:

my %s = (split (/([[:alpha:]])\1+/, '123aaaa23a3'), '' );
print Dumper \%s;

This'll give you:

$VAR1 = {
          '23a3' => '',
          '123' => 'a'
        };

So you can extract your pattern via keys.

Unfortunately my second approach of 'selecting out' the pattern matches via %+ doesn't help particularly (split doesn't populate the regex stuff).

But something like this:

my @delims ='123aaaa23a3' =~ m/(?<delim>[[:alpha:]])\g{delim}+/g; 
print Dumper \%+;

By using a named capture, we identify that a is from the capture group. Unfortunately, this doesn't seem to be populated when you do this via split - which might lead to a two-pass approach.

This is the closest I got:

#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;

my $str = '123aaaa23a3';

#build a regex out of '2-or-more' characters. 
my $regex = join ( "|", map { $_."{2,}"} $str =~ m/([[:alpha:]])\1+/g);
#make the regex non-capturing
$regex = qr/(?:$regex)/;
print "Using: $regex\n";

#split on the regex
my @s  = split m/$regex/, $str;

print Dumper \@s;

We first process the string to extract "2-or-more" character patterns, to set as our delmiters. Then we assemble a regex out of them, using non-capturing, so we can split.

like image 180
Sobrique Avatar answered Oct 16 '22 15:10

Sobrique


One solution would be to use your original split call and throw away every other value. Conveniently, List::Util::pairkeys is a function that keeps the first of every pair of values in its input list:

use List::Util 1.29 qw( pairkeys );

my @vals = pairkeys split /([[:alpha:]])\1+/, '123aaaa23a3';

Gives

Odd number of elements in pairkeys at (eval 6) line 1.
[ '123', '23a3' ]

That warning comes from the fact that pairkeys wants an even-sized list. We can solve that by adding one more value at the end:

my @vals = pairkeys split( /([[:alpha:]])\1+/, '123aaaa23a3' ), undef;

Alternatively, and maybe a little neater, is to add that extra value at the start of the list and use pairvalues instead:

use List::Util 1.29 qw( pairvalues );

my @vals = pairvalues undef, split /([[:alpha:]])\1+/, '123aaaa23a3';
like image 34
LeoNerd Avatar answered Oct 16 '22 16:10

LeoNerd