Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Portable (cross platform) scripting with unicode filenames

Tags:

bash

That's driving me crazy. Have the next bash script.

testdir="./test.$$"
echo "Creating a testing directory: $testdir"
mkdir "$testdir"
cd "$testdir" || exit 1

echo "Creating a file word.txt with content á.txt"
echo 'á.txt' > word.txt

fname=$(cat word.txt)
echo "The word.txt contains:$fname"

echo "creating a file $fname with a touch"
touch $fname
ls -l

echo "command: bash cycle"
while read -r line
do
    [[ -e "$line" ]] && echo "$line is a file"
done < word.txt

echo "command: find . -name $fname -print"
find . -name $fname -print

echo "command: find . -type f -print | grep $fname"
find . -type f -print | grep "$fname"

echo "command: find . -type f -print | fgrep -f word.txt"
find . -type f -print | fgrep -f word.txt

On the Freebsd (and probably on Linux too) gives the result:

Creating a testing directory: ./test.64511
Creating a file word.txt with content á.txt
The word.txt contains:á.txt
creating a file á.txt with a touch
total 1
-rw-r--r--  1 clt  clt  7  3 júl 12:51 word.txt
-rw-r--r--  1 clt  clt  0  3 júl 12:51 á.txt
command: bash cycle
á.txt is a file
command: find . -name á.txt -print
./á.txt
command: find . -type f -print | grep á.txt
./á.txt
command: find . -type f -print | fgrep -f word.txt
./á.txt

Even in the Windows 7 (with cygwin installed) running the script gives correct result.

But when i run this script on OS X bash, got this:

Creating a testing directory: ./test.32534
Creating a file word.txt with content á.txt
The word.txt contains:á.txt
creating a file á.txt with a touch
total 8
-rw-r--r--  1 clt  staff  0  3 júl 13:01 á.txt
-rw-r--r--  1 clt  staff  7  3 júl 13:01 word.txt
command: bash cycle
á.txt is a file
command: find . -name á.txt -print
command: find . -type f -print | grep á.txt
command: find . -type f -print | fgrep -f word.txt

So, only the bash found the file á.txt no, find nor grep. :(

Asked first on apple.stackexchange and one answer suggesting to use the iconv for converting filenames.

$ find . -name $(iconv -f utf-8 -t utf-8-mac <<< á.txt)

While this is works for the "OS X", but it is terrible anyway. (needing enter another command for every utf8 string what entering to the terminal.)

I'm trying to find an general cross platform bash programming solution. So, the questions are:

  • Why on the OS X the bash "found" the file and the find doesn't?

and

  • How to write cross-platform bash script where unicode filenames are stored in a file.
  • the only solution is write special versions only for OS X with the iconv?
  • exists portable solution for other scripting languages like perl and so?

Ps: and finally, not really programming question, but wondering what is rationale behind Apple's decision using decomposed filenames what doesn't play nicely with command line utf8

EDIT

Simple od.

$ ls | od -bc
0000000   141 314 201 056 164 170 164 012 167 157 162 144 056 164 170 164
           a   ́    **   .   t   x   t  \n   w   o   r   d   .   t   x   t
0000020   012                                                            
          \n   

and

$ od -bc word.txt
0000000   303 241 056 164 170 164 012                                    
           á  **   .   t   x   t  \n                                    
0000007

so the

$ while read -r line; do echo "$line" | od -bc; done < word.txt
0000000   303 241 056 164 170 164 012                                    
           á  **   .   t   x   t  \n                                    
0000007

and outpout from a find is the same as ls

$ find . -print | od -bc
0000000   056 012 056 057 167 157 162 144 056 164 170 164 012 056 057 141
           .  \n   .   /   w   o   r   d   .   t   x   t  \n   .   /   a
0000020   314 201 056 164 170 164 012                                    
           ́    **   .   t   x   t  \n      

So, the content of word.txt IS DIFFERENT what file is created from its content. Therefore, still havent explanation why the bash found the file.

like image 774
jm666 Avatar asked Jul 03 '13 11:07

jm666


1 Answers

Unicode is hard. Repeat it every time you brush your teeth.

Your á.txt filename contains 5 characters, of which á is the troublesome one. There is more than one way to represent á as a sequence of Unicode code points. There's the precomposed representation, and the decomposed one. Unfortunately most software is not prepared to deal with characters, settling for code points instead (yes most software is cr*p). This means that given precomposed and decomposed representations of the same character, software will not recognize them as the same.

You have a precomposed á, represented as Unicode code point U+00E1 LATIN SMALL LETTER A WITH ACUTE. Windows uses the precomposed representation. Mac filesystems insist on the decomposed representation (well, mostly; utf-8-mac does not decompose certain character ranges, but á is decomposed OK). So on a mac your á becomes U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT (writing off the top of my head, not having a Mac handy). Linux filesystems accept whatever you throw at them.

If you give find a precomposed á, it will not find a file with a decomposed á in its name, because it's not prepared to deal with this brouhaha.

So what's the solution? There isn't any. If you want to handle Unicode, you have to work around defects of the common tools.

Here's one slightly less ugly workaround. Write a small bash function (using iconv or whatever) that for each system will convert a representation acceptable on that system, and use it throughout. Let's call it u8:

find . -name $(u8 $myfilename) -print 
find . -name -type f -print | fgrep $(u8 $myfilename)

and so on. Pretty it's not, but it should work.

Oh and I think we all should start sending bug reports for this cr*p. Our software should eventually strive to understand basic human concepts like characters (I'm not even starting to talk about strings). Code points just don't cut it, sorry, even if they're Unicode code points.

like image 72
n. 1.8e9-where's-my-share m. Avatar answered Oct 13 '22 22:10

n. 1.8e9-where's-my-share m.