Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Raku Is there a fast method to find and remove/replace non-ASCII or malformed utf8 characters?

Tags:

regex

char

raku

I frequently encounter malformed utf-8 characters that breaks my codes. I have read some (not all) related questions/answers on stackoverflow, but nothing specific to Raku/perl6. Is there a fast way to remove these pesky characters from strings? The predefined character classes in "https://docs.raku.org/language/regexes#Predefined_character_classes" just won't do it:

Example: from REPL:

> say "â " ~~ /\w/ # you have to have a space following the "a" with "^" for it to work
「â」
> say "�" ~~ /\w/ # without the space, the character doesn't look normal
Malformed UTF-8 at line 1 col 6

> say "â ".chars # looks like 2 chars, but it says 1 char
1
> say "â ".comb.[0] # strange, the pesky char makes the space precede the cursor as I type
â
> say "â".comb.[0 ] # strange, the pesky char makes the space precede the cursor as I type
â
> say "â".comb.[0]  # there is a space following ']' or it won't work
â
> say "â".comb.[0 ] # very strange, must have space before ']'
â
> say "â".comb
(â)
> say "â".comb.[0] .ord # # same here, very strange, it makes space precede the cursor
226
> my $a = Buf.new(226)
Buf:0x<E2>
> say $a.decode
Malformed termination of UTF-8 string
  in block <unit> at <unknown file> line 1

> say $a.decode('utf8-c8')
􏿽xE2
> for @$a { say $_.chr; }
â
> say (@$a).elems
1
> say "â " ~~ / <alpha> / # again, must have space in the quote
「â」
 alpha => 「â」
> say "â " ~~ / <cntrl> /
Nil

This is very troublesome. How to remove these non-utf8 chars? Is there a predefined character class for all good utf-8 chars or for good ASCII chars that are model citizens?

like image 531
lisprogtor Avatar asked Feb 23 '20 18:02

lisprogtor


People also ask

What is non UTF-8 character?

Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages.

Which characters are not supported by UTF-8?

Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units.


1 Answers

Hopefully someone will have a better answer. In the meantime...


There are several very different things going on in your question.

Is there a fast method to find and remove/replace non-ASCII or malformed utf8 characters?

There is supposed to be a nice, obvious, fairly simple one:

say .decode: replacement => '�'
given $buf-that's-supposed-to-be-utf8

This should decode the same way a plain slurp does, except that, instead of just giving up on the decode when it encounters "Malformed UTF-8", it should just replace malformed data with the replacement character you've specified and continue as best it can.

Unfortunately (as far as I know) this doesn't work due to bugs in rakudo/moarvm as outlined in my answer to decode with replacement does not seem to work.

I did not file an issue at the time I wrote that SO. Your new SO has prompted me to file two bug reports:

  • .decode's replacement option didn't work in Rakudo v2019.03.01 and presumably still doesn't #3509

  • decoder replacement options didn't work in Rakudo v2019.03.01 and presumably still don't #1245


Some other options are given in the answers to error message: Malformed UTF-8.

I see in your repl examples you've tried .decode('utf8-c8'). This may be your best bet within raku as it stands.


If none of the above is helpful, I think you're stuck for now with using an external tool to preprocess files before they get to raku.

Is there a predefined character class for all good utf-8 chars

utf8 data is not characters. It's just bytes. The data encodes characters, or at least it's supposed to, but it's very important to keep encodings and characters separate in your mind.

If you know how old-fashioned telegrams work, it's like that. There's a message in characters. And then morse code for transmitting it. They're very different things.

When you see "Malformed UTF-8" or similar, it means the decoder is choking on some part of the data (the bytes). They make no sense to it as characters. It's like morse code that doesn't follow the rules for morse code.

Such data is considered to be confusing crap at best and dangerous crap at worst. The Unicode standard requires that it is entirely eliminated before you can do anything with it.

The obvious friendly solution is to replace crap with a user specified replacement character as you asked. In contrast, a regex character class is both the wrong tool and too late.

Example: from REPL

This is another whole ball of wax.

There's:

  • The encoding used by your (terminal on your) local system;

  • The characters you see rendered, and the indication of the cursor, when you use your local system;

  • What's in your cut/paste buffer when you copy from your repl display;

  • What your browser does with that buffer when you paste into the edit window for an SO question;

  • What SO's servers do with that the contents of the edit window when you click the Post your question button and when SO renders your question;

  • What my local system, browser, terminal, cut/paste buffer, etc. are doing when I look at your SO question;

  • Etc.

This complexity exists even if both our systems and both you and I are doing what we're supposed to be doing. So, sure, something is amiss with the cursor and other issues, but I'm not going to try nail that down with this answer because, unlike the first part of your question I answered above, it's not really to do with raku/do.

like image 143
raiph Avatar answered Oct 18 '22 13:10

raiph