Bash

Question

I have some CSV/tabular data in a file, like so:

1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9

(They're not always numbers, just random comma-separated values. Single-digit numbers are easier for an example, though.)

I want to shuffle a random 40% of any of the columns. As an example, say the 3rd one. So perhaps 3 and 1 get swapped with each other. Now the third column is:

1 << Came from the last position
8
5
7
3 << Came from the first position

I am trying to do this in place in a file from within a bash script that I am working on, and I am not having much luck. I keep wandering down some pretty crazy and fruitless grep rabbit holes that leave me thinking that I'm going the wrong way (the constant failure is what's tipping me off).

I tagged this question with a litany of things because I'm not entirely sure which tool(s) I should even be using for this.

Edit: I'm probably going to end up accepting Rubens' answer, however wacky it is, because it directly contains the swapping concept (which I guess I could have emphasized more in the original question), and it allows me to specify a percentage of the column for swapping. It also happens to work, which is always a plus.

For someone who doesn't need this, and just wants a basic shuffle, Jim Garrison's answer also works (I tested it).

A word of warning, however, on Rubens' solution. I took this:

for (i = 1; i <= NF; ++i) {
  delim = (i != NF) ? "," : "";
  ...
}
printf "
";

removed the printf " "; and moved the newline character up like this:

for (i = 1; i <= NF; ++i) {
  delim = (i != NF) ? "," : "
";
  ...
}

because just having "" on the else case was causing awk to write broken characters at the end of each line (\00). At one point, it even managed to replace my entire file with Chinese characters. Although, honestly, this probably involved me doing something extra stupid on top of this problem.

Jim Garrison · Accepted Answer

This will work for a specifically designated column, but should be enough to point you in the right direction. This works on modern bash shells including Cygwin:

paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)

The operative feature is "process substitution".

The paste command joins files horizontally, and the three pieces are split from the original file via cut, with the second piece (the column to be randomized) run through the shuf command to reorder the lines. Here's the output from running it a couple of times:

$ cat test.dat
1,7,3,2
8,3,8,0
4,9,5,3
8,5,7,3
5,6,1,9

$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,1,2
8,3,8,0
4,9,7,3
8,5,3,3
5,6,5,9

$ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat)
1,7,8,2
8,3,1,0
4,9,3,3
8,5,7,3
5,6,5,9

Rubens · Answer

Algorithm:

create a vector with n pairs, from 1 to number of lines, and the respective value in the line (for the selected column), and then sort it randomly;
find how many lines should be randomized: num_random = percentage * num_lines / 100;
select the first num_random entries from your randomized vector;
you may sort the selected lines randomly, but it should be already randomly sorted;

printing output:

i = 0
for num_line, value in column; do
    if num_line not in random_vector:
        print value; # printing non-randomized value
    else:
        print random_vector[i]; # randomized entry
        i++;
done

Implementation:

#! /bin/bash

infile=$1
col=$2
n_lines=$(wc -l < ${infile})
prob=$(bc <<< "$3 * ${n_lines} / 100")

# Selected lines
tmp=$(tempfile)
paste -d ',' <(seq 1 ${n_lines}) <(cut -d ',' -f ${col} ${infile}) \
    | sort -R | head -n ${prob} > ${tmp}

# Rewriting file
awk -v "col=$col" -F "," '
(NR == FNR) {id[$1] = $2; next}
(FNR == 1) {
    i = c = 1;
    for (v in id) {value[i] = id[v]; ++i;}
}
{
    for (i = 1; i <= NF; ++i) {
        delim = (i != NF) ? "," : "";
        if (i != col) {printf "%s%c", $i, delim; continue;}
        if (FNR in id) {printf "%s%c", value[c], delim; c++;}
        else {printf "%s%c", $i, delim;}
    }
    printf "
";
}
' ${tmp} ${infile}

rm ${tmp}

In case you want a close approach to in-placement, you may pipe the output back to the input file, using sponge.

Execution:

To execute, simply use:

$ ./script.sh <inpath> <column> <percentage>

As in:

$ ./script.sh infile 3 40
1,7,3,2
8,3,8,0
4,9,1,3
8,5,7,3
5,6,5,9

Conclusion:

This allows you to select the column, randomly sort a percentage of entries in that column, and replace the new column in the original file.

This script goes as proof like no other, not only that shell scripting is extremely entertaining, but that there are cases where it should definitely be used not. (:

Bash - Swap Values in Column

Tags:

linux

grep

sed

awk

Random Human

2 Answers

Jim Garrison

Rubens

Recent Activity

Donate For Us