Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a string containing space-delimited "key=value" pairs when the value can also contain space

I am working file transfers between servers that are logged. These eventually have to be uploaded to a database, so I am preprocessing them to check for errors. Each log file entry represented a transfer and they are of the format:

key1=value1 key2=value2 

for a total of 16 fields. Most of the transfers are fine, except when someone transfers a file which has name with a space in it. This messes up my processing because I just call split on space in my perl script. Example:

DATE=20130411140806.384553 HOST=somehost PROG=someserver NL.EVNT=FTP_INFO START=20130411140806.384109 USER=someuser FILE=/extended_path/Wallpapers Folder.ico BUFFER=98720 BLOCK=262144 NBYTES=0 VOLUME=/ STREAMS=2 STRIPES=1 DEST=[0.0.0.0] TYPE=STOR CODE=226

This is just one example where there is a space between "Wallpapers" and "Folder.ico". Is there any way to design a regular expression that could account for that and split all those key-value pairs? If there is no regular expression way to do it, could you suggest any other way for me to handle it?

My objective is replace those spaces with nothing (i.e. remove the space) or an underscore so that when I run the script for loading into the database, it won't have a problem just splitting on single space. I am using perl to do all this by the way.

like image 798
shaun Avatar asked Jul 02 '13 13:07

shaun


1 Answers

You can search for the unwanted spaces by using a lookahead that ensures that they don't preced a key:

$input =~ s/[ ](?!\S+=)/_/g;

The lookahead makes sure that there is no = before the next space character.

Then you can split on spaces.

Alternatively, to match right away, you can use a similar technique:

while ($input =~ m/(\S+)=((?:\S|[ ](?!\S+=))+)/g)
{
    # $1 is the key
    # $2 is the value
}

For the value we repeat either non-space characters or spaces that do not preced a key.

Working demo.

If your keys are always upper case, you can replace all \S+ in my code with [A-Z]+.

like image 105
Martin Ender Avatar answered Nov 03 '22 01:11

Martin Ender