Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random selection of columns using linux command

I have a flat file (.txt) with 606,347 columns and I want to extract 50,000 RANDOM columns, with exception of the first column, which is sample identification. How can I do that using Linux commands? My file looks like:

ID  SNP1    SNP2    SNP3
1   0   0   2
2   1   0   2
3   2   0   1
4   1   1   2
5   2   1   0

It is TAB delimited.

Thank you so much.

Cheers,

Paula.

like image 249
PaulaF Avatar asked Mar 23 '16 20:03

PaulaF


People also ask

How do I cut a specific column in Linux?

-c (column): To cut by character use the -c option. This can be a list of numbers separated comma or a range of numbers separated by hyphen(-). Tabs and backspaces are treated as a character. It is necessary to specify list of character numbers otherwise it gives error with this option.

What was the command in the Unix terminal to get a random sample of a file?

The shuf command generates random permutations from input lines to standard output. If given a file or series of files it will shuffle the lines and write the result to standard output. It can also limit the number of results returned supporting selecting random lines from a file or data from a list.

What is column command in Linux?

column command in Linux is used to display the contents of a file in columns. The input may be taken from the standard input or from the file. This command basically breaks the input into multiple columns. Rows are filled before columns. Empty lines from the input are ignored unless the -e option is used.


1 Answers

awk to the rescue!

$ cat shuffle.awk

   function shuffle(a,n,k) {
     for(i=1;i<=k;i++) {
       j=int(rand()*(n-i))+i
       if(j in a) a[i]=a[j]
       else a[i]=j
       a[j]=i;
     }
   }

   BEGIN {srand()}
   NR==1 {shuffle(ar,NF,ncols)}
         {for(i=1;i<=ncols;i++) printf "%s", $(ar[i]) FS; print ""}

general usage

$ echo $(seq 5) | awk -f shuffle.awk -v ncols=5
3 4 1 5 2

in your special case you can print $1 and start the function loop from 2.

i.e. change

for(i=1;i<=k;i++) to a[1]=1; for(i=2;i<=k;i++)

like image 165
karakfa Avatar answered Sep 25 '22 04:09

karakfa