Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can raku avoid this Malformed UTF-8 error?

Tags:

unicode

raku

When I run this raku script...

my $proc = run( 'tree', '--du', :out);
$proc.out.slurp(:close).say;

I get this error on MacOS...

Malformed UTF-8 near bytes ef b9 5c

... instead of something like this tree output from zsh which is what I want...

.
├── 00158825_20210222_0844.csv
├── 1970-Article\ Text-1971-1-2-20210118.docx
├── 1976-Article\ Text-1985-1-2-20210127.docx
├── 2042-Article\ Text-2074-1-10-20210208.pdf
├── 2045-Article\ Text-2076-1-10-20210208.pdf
├── 6.\ Guarantor\ Form\ (A).pdf

I have tried slurp(:close, enc=>'utf8-c8') and the error is the same.

I have also tried...

 shell( "tree --du >> .temp.txt" );
 my @lines = open(".temp.txt").lines;
 dd @lines;

... and the error is the same.

Opening .temp.txt reveals this...

.
â<94><9c>â<94><80>â<94><80> [    1016739]  True  
â<94><9c>â<94><80>â<94><80> [ 9459042241]  dir-name
â<94><82>   â<94><9c>â<94><80>â<94><80> [     188142]  Business
â<94><82>   â<94><82>   â<94><9c>â<94><80>â<94><80> [       9117]  KeyDates.xlsx
â<94><82>   â<94><82>   â<94><9c>â<94><80>â<94><80> [      13807]  MondayNotes.docx

file -I gives this...

.temp.txt: text/plain; charset=unknown-8bit

Any advice?

[this is Catalina 10.15.17, Terminal encoding Unicode(UTF-8) Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2020.10. Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d. Built on MoarVM version 2020.10.]

like image 915
p6steve Avatar asked Mar 13 '21 18:03

p6steve


People also ask

What is UTF-8 and what problem does it solve?

UTF-8 is a way of encoding Unicode so that an ASCII text file encodes to itself. No wasted space, beyond the initial bit of every byte ASCII doesn't use. And if your file is mostly ASCII text with a few non-ASCII characters sprinkled in, the non-ASCII characters just make your file a little longer.

What is a valid UTF-8?

A valid UTF-8 character can be 1 - 4 bytes long. For a 1-byte character, the first bit is a 0 , followed by its unicode. For an n-bytes character, the first n-bits are all ones, the n+1 bit is 0, followed by n-1 bytes with most significant 2 bits being 10 .


2 Answers

It seems like you have a codepage/locale that is not Utf8. (Or tree is ignoring the codepage and using something different.)

A quick … get something, anything out of it; is to use an 8-bit single-byte encoding.

run( 'tree', '--du', :out, :enc<latin1> );

It generally is enough to see where decoding starts to go wrong with Utf8.


That said, let's look at your expected output, and the file output.

say '├──'.encode; # utf8:0x<E2 94 9C E2 94 80 E2 94 80>

In your file you have

â<94><9c>â<94><80>â<94><80> [    1016739]  True

Wait …

say 'â'.encode('latin1'); # Blob[uint8]:0x<E2>
<E2><94><9c><E2><94><80><E2><94><80>

       <E2 94 9c E2 94 80 E2 94 80>

utf8:0x<E2 94 9C E2 94 80 E2 94 80>

Yeah, those look an awful lot alike.
In that they are exactly the same.

So it does appear to be producing the expected output to some extent.

Which seems to confirm, that yes there is an encoding problem in-between tree and your code. That indicates that the codepage/locale is set wrong.


You haven't really provided enough information to figure out exactly what's going wrong where. You should have used run in binary mode to give us the exact output.

say run('echo', 'hello', :out, :bin).out.slurp;
# Buf[uint8]:0x<68 65 6C 6C 6F 0A>

You also didn't say if <9c> is literally in the file as four text characters, or if it is a feature of whatever you used to open the file turning binary data into text.

It also would be nice if all of the example data was of the same thing.


On a slightly related note…

Since tree gives filenames, and filenames are not Unicode, using utf8-c8 is appropriate here.
(Same generally goes for usernames and passwords.)

Here's some code that I ran on my computer to hopefully show why.

say dir(:test(/^ r.+sum.+ $/)).map: *.relative.encode('utf8-c8').decode
# (résumé résumé résumé résumé)

dir(:test(/^ r.+sum.+ $/)).map: *.relative.encode('utf8-c8').say
# Blob[uint8]:0x<72 65 CC 81 73 75 6D 65 CC 81>
# Blob[uint8]:0x<72 C3 A9 73 75 6D 65 CC 81>
# Blob[uint8]:0x<72 C3 A9 73 75 6D C3 A9>
# Blob[uint8]:0x<72 65 CC 81 73 75 6D C3 A9>

say 'é'.NFC;
# NFC:0x<00e9>
say 'é'.NFD
# NFD:0x<0065 0301>

sub to-Utf8 ( Uni:D $_ ){
   .map: *.chr.encode
}

say to-Utf8 'é'.NFC
# (utf8:0x<C3 A9>)
say to-Utf8 'é'.NFD
# (utf8:0x<65> utf8:0x<CC 81>)

So é is either encoded as one composed codepoint <C3 A9> or two decomposed codepoints <65> <CC 81>.

Did I really create 4 files with the “same name” just for this purpose?
Yes. Yes I did.

like image 175
Brad Gilbert Avatar answered Oct 11 '22 06:10

Brad Gilbert


Update I had deleted this nanswer because Brad's excellent answer and Valle Lukas's spot on comment seemed to render it moot. Then @p6steve confirmed both Brad's answer and Valle Lukas's solutions worked for them, so all the more reason to keep it deleted. But too late! A mistake in my nanswer had misled @p6steve who made a similar mistake in a follow up SO. Wea Culpa. To atone for my sins, I'm now permanently undeleting and leaving my shameful past for all to see.


This is a nanswer. I don't know Mac, but do love investigation, and what I've got to say won't fit in the comments.


Update The 'find .' in the following should be 'find', '.'. See run doc.

What do you get with this?:

say .out.lines given run 'find .', :out

If find . works, the problem is presumably tree.

If find . doesn't work, then try something really simple, that's built into MacOS, something that really should work. If it doesn't work, then the problem isn't tree but something more basic.


Malformed UTF-8 near bytes ef b9 5c

That means Raku was expecting UTF-8 but the input wasn't UTF-8.

Translating the message from computerese into English:

The supposedly English string "[Linux] xshell远程登陆CentOS时中文乱码解决_Cindy的博客 ..." is Malformed near 远程登.

In other words, the tree command is not generating UTF-8.

(Therefore using utf8-c8 will almost certainly be useless in the first instance. Its purpose is to cheat. It's for when text is either almost all UTF-8 except for a handful of rogue bytes, and you can't be bothered to sort out the input, or when you have absolutely no choice but to accept the input as it is and still want to muddle through. But in this case you surely ought either sort the problem out by getting to the bottom of things, or find some alternative to tree.)


Terminal encoding Unicode(UTF-8)

A google for "Terminal encoding Unicode(UTF-8)" yields just 7 matches. None appeared to be exact matches for "Terminal encoding Unicode(UTF-8)". All but one look to me like ... ef b9 5c looks to Rakudo. :)

If you copy/pasted that string, where did you copy it from?

If you yourself wrote that string, why were you so sure MacOS really was encoding tree's output as UTF-8 when run via the kernel (not a shell) that you wrote that it was?


run doesn't use a shell.


The current doc claims shell uses /bin/sh -c on MacOS.

What's the output of this?:

readlink -e $(which sh)

Is the output zsh?

If so sh -c should be using it.

If not, that may be the problem.


When one uses shell, one has to ensure the passed string is appropriately quoted and escaped. What do you get when you try these?:

say .out.lines given shell "'find .'", :out;
say .out.lines given shell "'tree --du'", :out;

What exactly is tree invoking? Is it a shell alias in zsh? If it's a binary, where did you install it from and how did you configure it, especially in terms of influencing zsh's handling of encodings?

like image 3
raiph Avatar answered Oct 11 '22 06:10

raiph