Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I change my regular expression to read UTF-8?

Tags:

regex

utf-8

perl

I got very far in a script I am working on only to find out it has a problem reading UTF-8 characters.

I have a contact in Sweden that made a VM on his machine with some UTF-8 in it and when my script hit that VM it lost its mind, but it was able to read all of the other VMs that are in the "normal" charset.

Anyhow, maybe my code will make more sense.

#!/usr/bin/perl
use strict;
use warnings;
#use utf8;
use Net::OpenSSH;

# Create a hash for storing the options needed by Net::OpenSSH
my %ssh_options = (
    port => '22',
    user => 'root',
    password => 'password'
);

# Create a new Net::OpenSSH object
my $ssh = Net::OpenSSH->new('192.168.2.101', %ssh_options);

# Create an array and capture the ESX\ESXi output from the current server
my @getallvms = $ssh->capture('vim-cmd vmsvc/getallvms');
shift @getallvms;
# Process data gathered from server
foreach my $vm (@getallvms) {
    # Match ID, NAME
    $vm =~  m/^(?<id> \d+)\s+(?<name> .+?)\s+/xm;
    my $id = "$+{id}";
    my $name = "$+{name}";
    print "$id\n";
    print "$name\n";
    print "\n";
}

I have narrowed it down to my regular expression as the problem, because here the raw output from the server before regular expression is applied.

416
TEST Box åäö!"''*#

And this is what I get after I apply my regular expression

416
TEST

For some reason the regular expression is not matching, I just don't know why. And the current regular expression in the example is the third attempt at getting it to work.

The FULL line that I am matching looks like this. The way my regular expression was done was because I only need the first two blocks of information, the expression you have wants to copy the entire line.

The code:

432    TEST Box åäö!"''*#   [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx   slesGuest    vmx-04
like image 954
ianc1215 Avatar asked Dec 03 '22 09:12

ianc1215


2 Answers

The subpattern

(?<name> .+?)\s+

in your regular expression means “match and remember one or more non-newline characters, but stop as soon as you find whitespace,” so $name contains TEST because the pattern stopped matching when it saw the space just before Box.

The VI Toolkit wiki gives an example of the getallvms subcommand's output:

# vmware-vim-cmd -H 10.10.10.10 -U root -P password /vmsvc/getallvms
Vmid    Name               File                 Guest OS       Version   Annotation
64     bartPE    [store] BartPE/BartPE.vmx     winXPProGuest     vmx-04
96     trustix   [store] Trustix/Trustix.vmx   otherLinuxGuest   vmx-04

The case is slightly different from the example in your question, but it appears that we can look for [store] as a bumper for the match:

/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix

The non-greedy quantifier +? means match one or more of something, but the match wants to hand control to the rest of the pattern as quickly as possible. Remember that [ has a special meaning in regular expressions, but the pattern \[ matches a literal rather than introducing a character class.

I think of this technique as bookending or tacking-and-stretching. If you want to extract a chunk of text that's difficult to characterize, look for surrounding features that are easy to match—often as simple as ^ or $. Then use a stretchy pattern to grab everything in between, usually (.+) or (.+?). Read the “Quantifiers” section of the perlre documentation for an explanation of your many options.

This fixes the immediate problem, and you can also add polish in a few areas.

Do not use $1, $2, and friends unconditionally! Always test that the pattern matches before using capture variables. For example

if (/(foo|bar|baz)/) {
  print "got $1\n";
}
else {
  print "no match\n";
}

An unprotected print $1 can produce surprising results that are tough to debug.

Judicious use of Perl's defaults can help emphasize the computation and lets the mechanism fade into the background. Dropping $vm in favor of $_ as the implicit loop variable and implicit match target makes for a nicer result.

Your comments merely translate from Perl to English. The most helpful comments explain the why, not the what. Also keep in mind Rob Pike's advice on commenting:

If your code needs a comment to be understood, it would be better to rewrite it so it's easier to understand.

In the assignments from %+, the quotes don't do anything useful. The values are already strings, so remove the quotes.

my $id   = $+{id};
my $name = $+{name};

Below is a modified version of your code that captures everything after the number but before [store] into $name. The utf8 pragma declares that your source code—not, as with a common mistake, your input—contains UTF-8. The test below simulates with a canned echo the output from vim-cmd on the Swedish VM.

As Tom suggested, I use the Encode module to decode the output that arrives through the SSH connection and encode it for benefit of the local host before printing it out.

The perlunifaq documentation advises decoding external data into Perl's internal format and then encoding any output just before it's written. I assume that the value returned from $ssh->capture(...) uses UTF-8 encoding, that is, that the remote host is sending UTF-8. We see the expected result because I'm running a modern distribution of Linux and ssh-ing back to it, but in the wild, you may be dealing with some other encoding.

You're able to get away with skipping the calls to decode and encode because Perl's internal format happens to match those of the hosts you're using. In general, however, cutting corners can get you into trouble:

  • What if I don't decode?
  • What if I don't encode?

Finally, the code!

#! /usr/bin/env perl

use strict;
use utf8;
use warnings;

use Encode;
use Net::OpenSSH;

my %ssh_options = ();
my $ssh = Net::OpenSSH->new('localhost', %ssh_options);

# Create an array and capture the ESX\ESXi output from the current server
#my @getallvms = $ssh->capture('vim-cmd vmsvc/getallvms');
my @getallvms = $ssh->capture(<<EOEcho);
echo -e 'JUNK\n416 TEST Box åäö!"'\\'\\''*#    [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx   slesGuest    vmx-04'
EOEcho
shift @getallvms;

for (@getallvms) {
  $_ = decode "utf8", $_, Encode::FB_CROAK;

  if (/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix) {
    my $id   = $+{id};
    my $name = $+{name};
    print encode("utf8", $id),   "\n",
          encode("utf8", $name), "\n",
          "\n";
  }
  else {
    print "no match\n";
  }
}

Output:

416
TEST Box åäö!"''*#

like image 56
Greg Bacon Avatar answered Jan 23 '23 01:01

Greg Bacon


If you know the string you work on is UTF-8 and Net::OpenSSH doesn't (and hence doesn't mark it as such), you can convert it to an internal representation Perl can work on with one of:

use Encode;
decode_utf8( $in_place );
$decoded = decode_utf8( $raw );
like image 42
JB. Avatar answered Jan 23 '23 02:01

JB.