I have a flat file (.txt) with 606,347 columns and I want to extract 50,000 RANDOM columns, with exception of the first column, which is sample identification. How can I do that using Linux commands? My file looks like:
ID SNP1 SNP2 SNP3
1 0 0 2
2 1 0 2
3 2 0 1
4 1 1 2
5 2 1 0
It is TAB delimited.
Thank you so much.
Cheers,
Paula.
-c (column): To cut by character use the -c option. This can be a list of numbers separated comma or a range of numbers separated by hyphen(-). Tabs and backspaces are treated as a character. It is necessary to specify list of character numbers otherwise it gives error with this option.
The shuf command generates random permutations from input lines to standard output. If given a file or series of files it will shuffle the lines and write the result to standard output. It can also limit the number of results returned supporting selecting random lines from a file or data from a list.
column command in Linux is used to display the contents of a file in columns. The input may be taken from the standard input or from the file. This command basically breaks the input into multiple columns. Rows are filled before columns. Empty lines from the input are ignored unless the -e option is used.
awk
to the rescue!
$ cat shuffle.awk
function shuffle(a,n,k) {
for(i=1;i<=k;i++) {
j=int(rand()*(n-i))+i
if(j in a) a[i]=a[j]
else a[i]=j
a[j]=i;
}
}
BEGIN {srand()}
NR==1 {shuffle(ar,NF,ncols)}
{for(i=1;i<=ncols;i++) printf "%s", $(ar[i]) FS; print ""}
general usage
$ echo $(seq 5) | awk -f shuffle.awk -v ncols=5
3 4 1 5 2
in your special case you can print $1 and start the function loop from 2.
i.e. change
for(i=1;i<=k;i++)
to a[1]=1; for(i=2;i<=k;i++)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With