Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting IFS to null byte does not split lines correctly in command line

Tags:

bash

ifs

~ ls
A B C

On bash (looks wrong)

~IFS=$'\x00' read -a vars < <(find -type f -print0); echo "${vars}"
ABC

On zsh (looks good)

~IFS=$'\x00' read -A vars < <(find -type f -print0); echo "${vars}"
A B C

Is it a bash bug?

like image 335
Pan Ruochen Avatar asked Mar 06 '19 03:03

Pan Ruochen


Video Answer


2 Answers

The null character is very special and POSIX and bash do not allow it inside strings (it is the definition of the end of a string, so $'\x00' and $'\000' pretty much never work; Inian's answer here even links to a workaround for entering the null character, but again you cannot expect that to be properly preserved when you assign it to a variable). Looks like zsh doesn't mind it, but bash does.

Here's a test that illustrates the problems representing space, tab, and newline characters in filenames:

$ touch 'two words' tabbed$'\t'words "two
lines"
$ ls            # GNU coreutils ls displays using bash's $'string' notation
'tabbed'$'\t''words'  'two'$'\n''lines'  'two words'
$ ls |cat       # … except when piped elsewhere
tabbed  words
two
lines
two words
$ find *        # GNU findutils find displays tabs & newlines as questions
tabbed?words
two?lines
two words
$ find * |cat   # … except when piped elsewhere
tabbed  words
two
lines
two words
$ touch a b c   # (more tests for later)

The GNU tools are very smart and know this is a problem, so they come up with creative ways around it—but they aren't even consistent. ls assumes you're using bash or zsh (the $'…' syntax for a literal is not present in POSIX) and find gives you a question mark (itself a valid filename character, but it's a file glob that matches any character, so e.g. rm two?lines tabbed?words will delete both files, just like rm 'two'$'\n''lines' 'tabbed'$'\t''words'). Both present the truth when piped to another command like cat.

GNU/BSD/MacOSX/Busybox find and xargs

I see you're using GNU extensions: POSIX and BSD/OSX find don't allow an implicit path and POSIX find doesn't support -print0 though the POSIX find spec does mention it:

Other implementations have added other ways to get around this problem, notably a -print0 primary that wrote filenames with a null byte terminator. This was considered here, but not adopted. Using a null terminator meant that any utility that was going to process find's -print0 output had to add a new option to parse the null terminators it would now be reading.

The POSIX xargs spec similarly lacks support for -0 (there is no reference to it either), though it is supported by xargs in GNU, BSD/OSX, and busybox.

Therefore, you can probably do this:

$ find . -type f -print0 |xargs -0
./c ./b ./a ./two
lines ./tabbed  words ./two words

However, you might actually want the array, so perhaps I'm overfitting to your simplified question.

mapfile

You can use mapfile in Bash 4.4 and later:

$ mapfile -d '' vars < <(find . -type f -print0)
$ printf '<%s>\n' "${vars[@]}"
<./c>
<./b>
<./a>
<./two
lines>
<./tabbed   words>
<./two words>

Some commands, including mapfile, read, and readarray (a synonym of mapfile), accept -d '' as if it were -d $'\0', likely [citation needed] as a workaround for POSIX shell's aforementioned inability to deal with null characters in strings.

This mapfile command merely reads an input file (standard input in this case) into the $vars array as delimited by null characters. Standard input is populated via pipeline by means of a file descriptor created by the <(…) process substitution at the end of the line, which handles the output of our find command.

A short aside: You'd think you could simply do find … |mapfile … but that changes the scope and any variables you set or modify in there are lost when the pipeline command completes. The process substitution trick doesn't trap you in the same way.

The printf command simply demonstrates the contents of the array. The angle brackets denote the start and end of each item so you aren't confused by the newline, space, or tab.

like image 78
Adam Katz Avatar answered Oct 19 '22 00:10

Adam Katz


There are a lot of mis-conceptions in your logic in both the attempts above. In bash shell you just cannot store the value of NULL byte \x00 in a variable, be it the special IFS or any other user-defined variable. So your requirement to split the result of find over the NULL byte would never work. Because of this your results from find are stored in the array at first index as a one long entry concatenated with the NULL byte.

You can get around the problem of using the NULL byte in a variable by a few tricks defined in How to pass \x00 as argument to program?. You could use any other custom character for your IFS simply though as

IFS=: read -r -a splitList <<<"foo:bar:dude" 
declare -p splitList

The ideal way would to read NULL de-limited files would be set the delimiter field in read command to read until the null byte is encountered.

But then if you simply do

IFS= read -r -d '' -a files < <(find -type f -print0)

you only read the first file followed by the NULL byte and the array "${files[@]}" would just contain one filename. You need to read in a loop, until the last NULL byte is read and no more characters to read

declare -a array=()
while IFS= read -r -d '' file; do
    array+=( "$file" )
done < <(find -type f -print0)

which emits the results containing each file in a separate array entry

printf '%s\n' "${array[@]}"
like image 20
Inian Avatar answered Oct 19 '22 00:10

Inian