Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract every N columns and write into new files?

Tags:

linux

awk

I've been struggling to write a code for extracting every N columns from an input file and write them into output files according to their extracting order.

(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)

In a simpler case below, extracting every 3 columns(fields) from an input file with a start point of the 2nd column.

for example, if the input file looks like:

aa 1 2 3 4 5 6 7 8 9 
bb 1 2 3 4 5 6 7 8 9 
cc 1 2 3 4 5 6 7 8 9 
dd 1 2 3 4 5 6 7 8 9 

and I want the output to look like this: output_file_1:

1 2 3
1 2 3
1 2 3
1 2 3

output_file_2:

4 5 6  
4 5 6 
4 5 6 
4 5 6 

output_file_3:

7 8 9
7 8 9 
7 8 9
7 8 9

I tried this, but it doesn't work:

awk 'for(i=2;i<=10;i+a) {{printf "%s ",$i};a=3}' <inputfile>

It gave me syntax error and the more I fix the more problems coming out.

I also tried the linux command cut but while I was dealing with large files this seems effortless. And I wonder if cut would do a loop cut of every 3 fields just like the awk.

Can someone please help me with this and give a quick explanation? Thanks in advance.

like image 617
user1687130 Avatar asked Dec 21 '25 09:12

user1687130


2 Answers

Actions to be performed by awk on the input data must be included in curled braces, so the reason the awk one-liner you tried results in a syntax error is that the for cycle does not respect this rule. A syntactically correct version will be:

awk '{for(i=2;i<=10;i+a) {printf "%s ",$i};a=3}' <inputfile>

This is syntactically correct (almost, see end of this post.), but does not do what you think.

To separate the output by columns on different files, the best thing is to use awk redirection operator >. This will give you the desired output, given that your input files always has 10 columns:

awk '{ print $2,$3,$4 > "file_1"; print $5,$6,$7 > "file_2"; print $8,$9,$10 > "file_3"}' <inputfile>

mind the " " to specify the filenames.


EDITED: REAL WORLD CASE

If you have to loop along the columns because you have too many of them, you can still use awk (gawk), with two loops: one on the output files and one on the columns per file. This is a possible way:

#!/usr/bin/gawk -f 

BEGIN{
  CTOT = 24005 # total number of columns, you can use NF as well
  DELTA = 800  # columns per file
  START = 6 # first useful column
  d = CTOT/DELTA # number of output files.
}
{
  for ( i = 0 ; i < d ; i++)
  {
    for ( j = 0 ; j < DELTA ; j++)
    {
      printf("%f\t",$(START+j+i*DELTA)) > "file_out_"i
    }
    printf("\n") >  "file_out_"i
   }
 }

I have tried this on the simple input files in your example. It works if CTOT can be divided by DELTA. I assumed you had floats (%f) just change that with what you need.

Let me know.


P.s. going back to your original one-liner, note that the loop is an infinite one, as i is not incremented: i+a must be substituted by i+=a, and a=3 must be inside the inner braces:

awk '{for(i=2;i<=10;i+=a) {printf "%s ",$i;a=3}}' <inputfile>

this evaluates a=3 at every cycle, which is a bit pointless. A better version would thus be:

awk '{for(i=2;i<=10;i+=3) {printf "%s ",$i}}' <inputfile>

Still, this will just print the 2nd, 5th and 8th column of your file, which is not what you wanted.

like image 120
lev.tuby Avatar answered Dec 23 '25 01:12

lev.tuby


awk '{ print $2, $3,  $4 >"output_file_1";
       print $5, $6,  $7 >"output_file_2";
       print $8, $9, $10 >"output_file_3";
     }' input_file

This makes one pass through the input file, which is preferable to multiple passes. Clearly, the code shown only deals with the fixed number of columns (and therefore a fixed number of output files). It can be modified, if necessary, to deal with variable numbers of columns and generating variable file names, etc.


(My real world case is to extract every 800 columns from a total 24005 columns file starting at column 6, so I need a loop)

In that case, you're correct; you need a loop. In fact, you need two loops:

awk 'BEGIN { gap = 800; start = 6; filebase = "output_file_"; }
     {
         for (i = start; i < start + gap; i++)
         {
             file = sprintf("%s%d", filebase, i);
             for (j = i; j <= NF; j += gap)
                  printf("%s ", $j) > file;
             printf "\n" > file;
         }
     }' input_file

I demonstrated this to my satisfaction with an input file with 25 columns (numbers 1-25 in the corresponding columns) and gap set to 8 and start set to 2. The output below is the resulting 8 files pasted horizontally.

2 10 18    3 11 19    4 12 20    5 13 21    6 14 22    7 15 23    8 16 24    9 17 25
2 10 18    3 11 19    4 12 20    5 13 21    6 14 22    7 15 23    8 16 24    9 17 25
2 10 18    3 11 19    4 12 20    5 13 21    6 14 22    7 15 23    8 16 24    9 17 25
2 10 18    3 11 19    4 12 20    5 13 21    6 14 22    7 15 23    8 16 24    9 17 25
like image 44
Jonathan Leffler Avatar answered Dec 23 '25 00:12

Jonathan Leffler



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!