The problem is I want to sort by groups (e.g. 3 entities separated by a delimiter act as a group).
For certain reasons, the only separator I can use is the NULL char '\0'.
Take this input example (where each entity is one char for simplicity):
'b\0s\0n\0c\0p\0f\0a\0z\0m\0'
The result would be (by taking groups of 3 entities):
'a\0z\0m\0b\0s\0n\0c\0p\0f\0', since a<b<c.
I am aware of sort, but unfortunately it does not work for what I am trying to do since it will sort each entities separately (no groups).
One solution would be to delimit the different groups by 2 NULL chars (e.g. 'b\0s\0n\0\0c\0p\0f\0\0a\0z\0m\0\0'), but again sort is not the right tool since it does not support multi-character tab.
So, for now, I don't know any solution in shell.
The typical data is about a hundred groups of 3 entities: size\0filepath\0folderpath\0 (where folderpath/filepath is the absolute path to the file). Since paths in Unix system can contain any character (except NULL), the only delimiter I can use is '\0'.
Ideally I would love a code that sort files by their size, the problem (as opposed to other SO questions I read) is here paths could have '\n' chars.
With GNU awk and an input file containing newline characters:
$ od -w -An -tx1 file
62 00 73 00 6e 0a 6e 00 63 00 70 00 66 0a 0a 66 00 61 00 7a 00 6d 0a 0a 0a 6d
$ awk -v RS="\x0" '
NR % 3 == 1 {i = i + 1}
{ t[i] = t[i] $0 "\x0" }
END { n = asort(t); for(i = 1; i <= n; i++) printf("%s", t[i]) }
' file | od -w -An -tx1
61 00 7a 00 6d 0a 0a 0a 6d 00 62 00 73 00 6e 0a 6e 00 63 00 70 00 66 0a 0a 66 00
We set the record separator to NUL, populate array t with groups of 3 tokens, at the end we sort the array with asort and print. If the default sorting algorithm is not what you want see https://www.gnu.org/software/gawk/manual/html_node/Array-Sorting-Functions.html for explanations about the sorting options.
With GNU sed and sort -z:
$ sed -zn 'N;s!\x0! !;N;s!\x0!/!p' file |
sort -z |
sed -zn 's! !\x0!;s!/!\x0!p' |
od -w -An -tx1
61 00 7a 00 6d 0a 0a 0a 6d 00 62 00 73 00 6e 0a 6e 00 63 00 70 00 66 0a 0a 66 00
If your first token is a size we can assume that it does not contain a space. If the second is a filename we can also assume that it does not contain a slash. So, we preprocess the input to replace the first NUL in a group by a space, the second by a slash, sort -z, and post-process to revert the preprocessing. Adapt the sort options to your specific needs.
Note that in both solutions a final NUL is added. Remove it with head -c-1 if it is an issue.
If you really have to insist on only using null delimiters, traditional Unix text tools are not generally able to cope with them portably. But many non-traditional, non-standard tools which cope fine with them are fairly ubiquitous in practice. Here are solutions in Python 3 and Perl.
import sys
with open(sys.argv[1], 'rb') as fh:
seq = fh.read().split(b'\x00')
items = [b'\x00'.join(seq[i:i + 3]) for i in range(0, len(seq)-1, 3)]
print(b'\x00'.join(sorted(items)).decode('us-ascii'))
Python 2 might still be the default version on some platforms where stability of the toolchain trumps convenience, security, and robustness. Python 2 had a simpler string type which was able to accommodate arbitrary binary data, so this code would probably actually be simpler in Python 2, and closer to the Perl version.
perl -0 -ne 'push @rec, $_;
if ($#rec == 2) { push @items, join("", @rec); @rec = (); }
END { print(join("", sort @items)) }' file
Both of these read all the data into memory, and thus will be inconvenient if you have more data than free RAM.
Demo: https://ideone.com/Kk7mxH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With