When I run this raku script... <pre class="prettyprint"><code>my $proc = run( 'tree', '--du', :out); $proc.out.slurp(:close).say; </code></pre> I get this error on MacOS... <pre class="prettyprint"><code>Malformed UTF-8 near bytes ef b9 5c </code></pre> ... instead of something like this tree output from zsh which is what I want... <pre class="prettyprint"><code>. ├── 00158825_20210222_0844.csv ├── 1970-Article\ Text-1971-1-2-20210118.docx ├── 1976-Article\ Text-1985-1-2-20210127.docx ├── 2042-Article\ Text-2074-1-10-20210208.pdf ├── 2045-Article\ Text-2076-1-10-20210208.pdf ├── 6.\ Guarantor\ Form\ (A).pdf </code></pre> I have tried <code>slurp(:close, enc=>'utf8-c8')</code> and the error is the same. I have also tried... <pre class="prettyprint"><code> shell( "tree --du >> .temp.txt" ); my @lines = open(".temp.txt").lines; dd @lines; </code></pre> ... and the error is the same. Opening .temp.txt reveals this... <pre class="prettyprint"><code>. â<94><9c>â<94><80>â<94><80> [ 1016739] True â<94><9c>â<94><80>â<94><80> [ 9459042241] dir-name â<94><82>Â Â â<94><9c>â<94><80>â<94><80> [ 188142] Business â<94><82>Â Â â<94><82>Â Â â<94><9c>â<94><80>â<94><80> [ 9117] KeyDates.xlsx â<94><82>Â Â â<94><82>Â Â â<94><9c>â<94><80>â<94><80> [ 13807] MondayNotes.docx </code></pre> file -I gives this... <pre class="prettyprint"><code>.temp.txt: text/plain; charset=unknown-8bit </code></pre> Any advice? [this is Catalina 10.15.17, Terminal encoding Unicode(UTF-8) Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2020.10. Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d. Built on MoarVM version 2020.10.]

It seems like you have a codepage/locale that is not Utf8. (Or <code>tree</code> is ignoring the codepage and using something different.) A quick … get something, anything out of it; is to use an 8-bit single-byte encoding. <pre class="prettyprint lang-raku prettyprint-override"><code>run( 'tree', '--du', :out, :enc<latin1> ); </code></pre> It generally is enough to see where decoding starts to go wrong with Utf8. <hr> That said, let's look at your expected output, and the file output. <pre class="prettyprint lang-raku prettyprint-override"><code>say '├──'.encode; # utf8:0x<E2 94 9C E2 94 80 E2 94 80> </code></pre> In your file you have <pre class="prettyprint lang-none prettyprint-override"><code>â<94><9c>â<94><80>â<94><80> [ 1016739] True </code></pre> Wait … <pre class="prettyprint lang-raku prettyprint-override"><code>say 'â'.encode('latin1'); # Blob[uint8]:0x<E2> </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code><E2><94><9c><E2><94><80><E2><94><80> <E2 94 9c E2 94 80 E2 94 80> utf8:0x<E2 94 9C E2 94 80 E2 94 80> </code></pre> Yeah, those look an awful lot alike. In that they are exactly the same. So it does appear to be producing the expected output to some extent. Which seems to confirm, that yes there is an encoding problem in-between <code>tree</code> and your code. That indicates that the codepage/locale is set wrong. <hr> You haven't really provided enough information to figure out exactly what's going wrong where. You should have used <code>run</code> in binary mode to give us the exact output. <pre class="prettyprint lang-perl prettyprint-override"><code>say run('echo', 'hello', :out, :bin).out.slurp; # Buf[uint8]:0x<68 65 6C 6C 6F 0A> </code></pre> You also didn't say if <code><9c></code> is literally in the file as four text characters, or if it is a feature of whatever you used to open the file turning binary data into text. It also would be nice if all of the example data was of the same thing. <hr> On a slightly related note… Since <code>tree</code> gives filenames, and filenames are not Unicode, using <code>utf8-c8</code> is appropriate here. (Same generally goes for usernames and passwords.) Here's some code that I ran on my computer to hopefully show why. <pre class="prettyprint lang-raku prettyprint-override"><code>say dir(:test(/^ r.+sum.+ $/)).map: *.relative.encode('utf8-c8').decode # (résumé résumé résumé résumé) dir(:test(/^ r.+sum.+ $/)).map: *.relative.encode('utf8-c8').say # Blob[uint8]:0x<72 65 CC 81 73 75 6D 65 CC 81> # Blob[uint8]:0x<72 C3 A9 73 75 6D 65 CC 81> # Blob[uint8]:0x<72 C3 A9 73 75 6D C3 A9> # Blob[uint8]:0x<72 65 CC 81 73 75 6D C3 A9> say 'é'.NFC; # NFC:0x<00e9> say 'é'.NFD # NFD:0x<0065 0301> sub to-Utf8 ( Uni:D $_ ){ .map: *.chr.encode } say to-Utf8 'é'.NFC # (utf8:0x<C3 A9>) say to-Utf8 'é'.NFD # (utf8:0x<65> utf8:0x<CC 81>) </code></pre> So <code>é</code> is either encoded as one composed codepoint <code><C3 A9></code> or two decomposed codepoints <code><65> <CC 81></code>. Did I really create 4 files with the “same name” just for this purpose? Yes. Yes I did.

Update I had deleted this nanswer because Brad's excellent answer and Valle Lukas's spot on comment seemed to render it moot. Then @p6steve confirmed both Brad's answer and Valle Lukas's solutions worked for them, so all the more reason to keep it deleted. But too late! A mistake in my nanswer had misled @p6steve who made a similar mistake in a follow up SO. Wea Culpa. To atone for my sins, I'm now permanently undeleting and leaving my shameful past for all to see. <hr> This is a nanswer. I don't know Mac, but do love investigation, and what I've got to say won't fit in the comments. <hr> Update The <code>'find .'</code> in the following should be <code>'find', '.'</code>. See <code>run</code> doc. What do you get with this?: <pre class="prettyprint"><code>say .out.lines given run 'find .', :out </code></pre> If <code>find .</code> works, the problem is presumably <code>tree</code>. If <code>find .</code> doesn't work, then try something really simple, that's built into MacOS, something that really should work. If it doesn't work, then the problem isn't <code>tree</code> but something more basic. <hr> <blockquote> Malformed UTF-8 near bytes ef b9 5c </blockquote> That means Raku was expecting UTF-8 but the input wasn't UTF-8. Translating the message from computerese into English: <blockquote> The supposedly English string "[Linux] xshell远程登陆CentOS时中文乱码解决_Cindy的博客 ..." is Malformed near <code>远程登</code>. </blockquote> In other words, the <code>tree</code> command is not generating UTF-8. (Therefore using <code>utf8-c8</code> will almost certainly be useless in the first instance. Its purpose is to cheat. It's for when text is either almost all UTF-8 except for a handful of rogue bytes, and you can't be bothered to sort out the input, or when you have absolutely no choice but to accept the input as it is and still want to muddle through. But in this case you surely ought either sort the problem out by getting to the bottom of things, or find some alternative to <code>tree</code>.) <hr> <blockquote> Terminal encoding Unicode(UTF-8) </blockquote> A google for "Terminal encoding Unicode(UTF-8)" yields just 7 matches. None appeared to be exact matches for "Terminal encoding Unicode(UTF-8)". All but one look to me like ... <code>ef b9 5c</code> looks to Rakudo. :) If you copy/pasted that string, where did you copy it from? If you yourself wrote that string, why were you so sure MacOS really was encoding <code>tree</code>'s output as UTF-8 when run via the kernel (not a shell) that you wrote that it was? <hr> <code>run</code> doesn't use a shell. <hr> The current doc claims <code>shell</code> uses <code>/bin/sh -c</code> on MacOS. What's the output of this?: <pre class="prettyprint"><code>readlink -e $(which sh) </code></pre> Is the output <code>zsh</code>? If so <code>sh -c</code> should be using it. If not, that may be the problem. <hr> When one uses <code>shell</code>, one has to ensure the passed string is appropriately quoted and escaped. What do you get when you try these?: <pre class="prettyprint"><code>say .out.lines given shell "'find .'", :out; say .out.lines given shell "'tree --du'", :out; </code></pre> <hr> What exactly is <code>tree</code> invoking? Is it a shell alias in <code>zsh</code>? If it's a binary, where did you install it from and how did you configure it, especially in terms of influencing <code>zsh</code>'s handling of encodings?

Can raku avoid this Malformed UTF-8 error?

Tags:

unicode

raku

When I run this raku script...

my $proc = run( 'tree', '--du', :out);
$proc.out.slurp(:close).say;

I get this error on MacOS...

Malformed UTF-8 near bytes ef b9 5c

... instead of something like this tree output from zsh which is what I want...

.
├── 00158825_20210222_0844.csv
├── 1970-Article\ Text-1971-1-2-20210118.docx
├── 1976-Article\ Text-1985-1-2-20210127.docx
├── 2042-Article\ Text-2074-1-10-20210208.pdf
├── 2045-Article\ Text-2076-1-10-20210208.pdf
├── 6.\ Guarantor\ Form\ (A).pdf

I have tried slurp(:close, enc=>'utf8-c8') and the error is the same.

I have also tried...

 shell( "tree --du >> .temp.txt" );
 my @lines = open(".temp.txt").lines;
 dd @lines;

... and the error is the same.

Opening .temp.txt reveals this...

.
â<94><9c>â<94><80>â<94><80> [    1016739]  True  
â<94><9c>â<94><80>â<94><80> [ 9459042241]  dir-name
â<94><82>Â Â  â<94><9c>â<94><80>â<94><80> [     188142]  Business
â<94><82>Â Â  â<94><82>Â Â  â<94><9c>â<94><80>â<94><80> [       9117]  KeyDates.xlsx
â<94><82>Â Â  â<94><82>Â Â  â<94><9c>â<94><80>â<94><80> [      13807]  MondayNotes.docx

file -I gives this...

.temp.txt: text/plain; charset=unknown-8bit

Any advice?

[this is Catalina 10.15.17, Terminal encoding Unicode(UTF-8) Welcome to 𝐑𝐚𝐤𝐮𝐝𝐨™ v2020.10. Implementing the 𝐑𝐚𝐤𝐮™ programming language v6.d. Built on MoarVM version 2020.10.]

915

asked Mar 13 '21 18:03

p6steve

2 Answers

It seems like you have a codepage/locale that is not Utf8. (Or tree is ignoring the codepage and using something different.)

A quick … get something, anything out of it; is to use an 8-bit single-byte encoding.

run( 'tree', '--du', :out, :enc<latin1> );

It generally is enough to see where decoding starts to go wrong with Utf8.

That said, let's look at your expected output, and the file output.

say '├──'.encode; # utf8:0x<E2 94 9C E2 94 80 E2 94 80>

In your file you have

â<94><9c>â<94><80>â<94><80> [    1016739]  True

Wait …

say 'â'.encode('latin1'); # Blob[uint8]:0x<E2>

<E2><94><9c><E2><94><80><E2><94><80>

       <E2 94 9c E2 94 80 E2 94 80>

utf8:0x<E2 94 9C E2 94 80 E2 94 80>

Yeah, those look an awful lot alike.
In that they are exactly the same.

So it does appear to be producing the expected output to some extent.

Which seems to confirm, that yes there is an encoding problem in-between tree and your code. That indicates that the codepage/locale is set wrong.

You haven't really provided enough information to figure out exactly what's going wrong where. You should have used run in binary mode to give us the exact output.

say run('echo', 'hello', :out, :bin).out.slurp;
# Buf[uint8]:0x<68 65 6C 6C 6F 0A>

You also didn't say if <9c> is literally in the file as four text characters, or if it is a feature of whatever you used to open the file turning binary data into text.

It also would be nice if all of the example data was of the same thing.

On a slightly related note…

Since tree gives filenames, and filenames are not Unicode, using utf8-c8 is appropriate here.
(Same generally goes for usernames and passwords.)

Here's some code that I ran on my computer to hopefully show why.

say dir(:test(/^ r.+sum.+ $/)).map: *.relative.encode('utf8-c8').decode
# (résumé résumé résumé résumé)

dir(:test(/^ r.+sum.+ $/)).map: *.relative.encode('utf8-c8').say
# Blob[uint8]:0x<72 65 CC 81 73 75 6D 65 CC 81>
# Blob[uint8]:0x<72 C3 A9 73 75 6D 65 CC 81>
# Blob[uint8]:0x<72 C3 A9 73 75 6D C3 A9>
# Blob[uint8]:0x<72 65 CC 81 73 75 6D C3 A9>

say 'é'.NFC;
# NFC:0x<00e9>
say 'é'.NFD
# NFD:0x<0065 0301>

sub to-Utf8 ( Uni:D $_ ){
   .map: *.chr.encode
}

say to-Utf8 'é'.NFC
# (utf8:0x<C3 A9>)
say to-Utf8 'é'.NFD
# (utf8:0x<65> utf8:0x<CC 81>)

So é is either encoded as one composed codepoint <C3 A9> or two decomposed codepoints <65> <CC 81>.

Did I really create 4 files with the “same name” just for this purpose?
Yes. Yes I did.

175

answered Oct 11 '22 06:10

Brad Gilbert

Update I had deleted this nanswer because Brad's excellent answer and Valle Lukas's spot on comment seemed to render it moot. Then @p6steve confirmed both Brad's answer and Valle Lukas's solutions worked for them, so all the more reason to keep it deleted. But too late! A mistake in my nanswer had misled @p6steve who made a similar mistake in a follow up SO. Wea Culpa. To atone for my sins, I'm now permanently undeleting and leaving my shameful past for all to see.

This is a nanswer. I don't know Mac, but do love investigation, and what I've got to say won't fit in the comments.

Update The 'find .' in the following should be 'find', '.'. See run doc.

What do you get with this?:

say .out.lines given run 'find .', :out

If find . works, the problem is presumably tree.

If find . doesn't work, then try something really simple, that's built into MacOS, something that really should work. If it doesn't work, then the problem isn't tree but something more basic.

Malformed UTF-8 near bytes ef b9 5c

That means Raku was expecting UTF-8 but the input wasn't UTF-8.

Translating the message from computerese into English:

The supposedly English string "[Linux] xshell远程登陆CentOS时中文乱码解决_Cindy的博客 ..." is Malformed near 远程登.

In other words, the tree command is not generating UTF-8.

(Therefore using utf8-c8 will almost certainly be useless in the first instance. Its purpose is to cheat. It's for when text is either almost all UTF-8 except for a handful of rogue bytes, and you can't be bothered to sort out the input, or when you have absolutely no choice but to accept the input as it is and still want to muddle through. But in this case you surely ought either sort the problem out by getting to the bottom of things, or find some alternative to tree.)

Terminal encoding Unicode(UTF-8)

A google for "Terminal encoding Unicode(UTF-8)" yields just 7 matches. None appeared to be exact matches for "Terminal encoding Unicode(UTF-8)". All but one look to me like ... ef b9 5c looks to Rakudo. :)

If you copy/pasted that string, where did you copy it from?

If you yourself wrote that string, why were you so sure MacOS really was encoding tree's output as UTF-8 when run via the kernel (not a shell) that you wrote that it was?

run doesn't use a shell.

The current doc claims shell uses /bin/sh -c on MacOS.

What's the output of this?:

readlink -e $(which sh)

Is the output zsh?

If so sh -c should be using it.

If not, that may be the problem.

When one uses shell, one has to ensure the passed string is appropriately quoted and escaped. What do you get when you try these?:

say .out.lines given shell "'find .'", :out;
say .out.lines given shell "'tree --du'", :out;

What exactly is tree invoking? Is it a shell alias in zsh? If it's a binary, where did you install it from and how did you configure it, especially in terms of influencing zsh's handling of encodings?

answered Oct 11 '22 06:10

raiph

Related questions
                            
                                In Windows, how do you enter a character outside of the Unicode Basic Multilingual Plane?
                            
                                Cross-platform unicode in C/C++: Which encoding to use?
                            
                                Getting question mark instead accented letter using spring MVC 3
                            
                                how to convert ANSI to utf8 in java? [duplicate]
                            
                                How to convert UTF-8 to unicode in Java?
                            
                                How to declare wchar_t and set its string value later on?
                            
                                How to get the substring that contains the first N unicode characters in Java
                            
                                Sorting in Ruby using the Unicode collation algorithm
                            
                                .NET Localization: Japanese Characters Display as Squares
                            
                                How to do proper Unicode and ANSI output redirection on cmd.exe?
                            
                                compare short strings in different languages for similar sound - is Soundex the answer?
                            
                                Normalize unicode string in SQL Server?
                            
                                Recommended font(s) for displaying unicode characters?
                            
                                boost::property_tree::json_parser and two-byte wide characters
                            
                                conversion of MathematicalPI symbol names to Unicode
                            
                                Difference between isdecimal and isdigit [duplicate]
                            
                                How do I match "i" with Turkish i in java?
                            
                                Why is `'↊'.isnumeric()` false?
                            
                                Rendering a unicode/ascii character to a numpy array
                            
                                Why nothing happened after selecting "Convert operators to Unicode" in Comma?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With