I have some CSV/tabular data in a file, like so:
1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9
(They're not always numbers, just random comma-separated values. Single-digit numbers are easier for an example, though.)
I want to shuffle a random 40% of any of the columns. As an example, say the 3rd one. So perhaps 3 and 1 get swapped with each other. Now the third column is:
1 << Came from the last position
8
5
7
3 << Came from the first position
I am trying to do this in place in a file from within a bash
script that I am working on, and I am not having much luck. I keep wandering down some pretty crazy and fruitless grep
rabbit holes that leave me thinking that I'm going the wrong way (the constant failure is what's tipping me off).
I tagged this question with a litany of things because I'm not entirely sure which tool(s) I should even be using for this.
Edit: I'm probably going to end up accepting Rubens' answer, however wacky it is, because it directly contains the swapping concept (which I guess I could have emphasized more in the original question), and it allows me to specify a percentage of the column for swapping. It also happens to work, which is always a plus.
For someone who doesn't need this, and just wants a basic shuffle, Jim Garrison's answer also works (I tested it).
A word of warning, however, on Rubens' solution. I took this:
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "";
...
}
printf "\n";
removed the printf "\n";
and moved the newline character up like this:
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "\n";
...
}
because just having ""
on the else case was causing awk
to write broken characters at the end of each line (\00
). At one point, it even managed to replace my entire file with Chinese characters. Although, honestly, this probably involved me doing something extra stupid on top of this problem.
This will work for a specifically designated column, but should be enough to point you in the right direction. This works on modern bash shells including Cygwin:
paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
The operative feature is "process substitution".
The paste
command joins files horizontally, and the three pieces are split from the original file via cut
, with the second piece (the column to be randomized) run through the shuf
command to reorder the lines. Here's the output from running it a couple of times:
$ cat test.dat
1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9
$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,1,2
8,3,8,0
4,9,7,3
8,5,3,3
5,6,5,9
$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,8,2
8,3,1,0
4,9,3,3
8,5,7,3
5,6,5,9
Algorithm:
n
pairs, from 1
to number of lines
, and the respective value in the line (for the selected column), and then sort it randomly;num_random = percentage * num_lines / 100
;num_random
entries from your randomized vector;printing output:
i = 0
for num_line, value in column; do
if num_line not in random_vector:
print value; # printing non-randomized value
else:
print random_vector[i]; # randomized entry
i++;
done
Implementation:
#! /bin/bash
infile=$1
col=$2
n_lines=$(wc -l < ${infile})
prob=$(bc <<< "$3 * ${n_lines} / 100")
# Selected lines
tmp=$(tempfile)
paste -d ',' <(seq 1 ${n_lines}) <(cut -d ',' -f ${col} ${infile}) \
| sort -R | head -n ${prob} > ${tmp}
# Rewriting file
awk -v "col=$col" -F "," '
(NR == FNR) {id[$1] = $2; next}
(FNR == 1) {
i = c = 1;
for (v in id) {value[i] = id[v]; ++i;}
}
{
for (i = 1; i <= NF; ++i) {
delim = (i != NF) ? "," : "";
if (i != col) {printf "%s%c", $i, delim; continue;}
if (FNR in id) {printf "%s%c", value[c], delim; c++;}
else {printf "%s%c", $i, delim;}
}
printf "\n";
}
' ${tmp} ${infile}
rm ${tmp}
In case you want a close approach to in-placement, you may pipe the output back to the input file, using sponge.
Execution:
To execute, simply use:
$ ./script.sh <inpath> <column> <percentage>
As in:
$ ./script.sh infile 3 40
1,7,3,2
8,3,8,0
4,9,1,3
8,5,7,3
5,6,5,9
Conclusion:
This allows you to select the column, randomly sort a percentage of entries in that column, and replace the new column in the original file.
This script goes as proof like no other, not only that shell scripting is extremely entertaining, but that there are cases where it should definitely be used not. (:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With