Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract one column from multiple files, and paste those columns into one file?

Tags:

linux

shell

paste

I want to extract the 5th column from multiple files, named in a numerical order, and paste those columns in sequence, side by side, into one output file.

The file names look like:

sample_problem1_part1.txt
sample_problem1_part2.txt

sample_problem2_part1.txt
sample_problem2_part2.txt

sample_problem3_part1.txt
sample_problem3_part2.txt
......

Each problem file (1,2,3...) has two parts (part1, part2). Each file has the same number of lines. The content looks like:

sample_problem1_part1.txt
1 1 20 20 1
1 7 21 21 2
3 1 22 22 3
1 5 23 23 4
6 1 24 24 5
2 9 25 25 6
1 0 26 26 7

sample_problem1_part2.txt
1 1 88 88 8
1 1 89 89 9
2 1 90 90 10
1 3 91 91 11
1 1 92 92 12
7 1 93 93 13
1 5 94 94 14

sample_problem2_part1.txt
1 4 330 30 a
3 4 331 31 b
1 4 332 32 c
2 4 333 33 d
1 4 334 34 e
1 4 335 35 f
9 4 336 36 g

The output should look like: (in a sequence of problem1_part1, problem1_part2, problem2_part1, problem2_part2, problem3_part1, problem3_part2,etc.,)

1 8 a ...
2 9 b ...
3 10 c ...
4 11 d ...
5 12 e ...
6 13 f ...
7 14 g ...

I was using:

 paste sample_problem1_part1.txt sample_problem1_part2.txt > \
     sample_problem1_partall.txt
 paste sample_problem2_part1.txt sample_problem2_part2.txt > \
     sample_problem2_partall.txt
 paste sample_problem3_part1.txt sample_problem3_part2.txt > \
     sample_problem3_partall.txt

And then:

for i in `find . -name "sample_problem*_partall.txt"`
do
    l=`echo $i | sed 's/sample/extracted_col_/'`
    `awk '{print $5, $10}'  $i > $l`
done    

And:

paste extracted_col_problem1_partall.txt \
      extracted_col_problem2_partall.txt \
      extracted_col_problem3_partall.txt > \
    extracted_col_problemall_partall.txt

It works fine with a few files, but it's a crazy method when the number of files is large (over 4000). Could anyone help me with simpler solutions that are capable of dealing with multiple files, please? Thanks!

like image 921
user1687130 Avatar asked Jan 29 '13 22:01

user1687130


1 Answers

Here's one way using awk and a sorted glob of files:

awk '{ a[FNR] = (a[FNR] ? a[FNR] FS : "") $5 } END { for(i=1;i<=FNR;i++) print a[i] }' $(ls -1v *)

Results:

1 8 a
2 9 b
3 10 c
4 11 d
5 12 e
6 13 f
7 14 g

Explanation:

  • For each line of input of each input file:

    • Add the files line number to an array with a value of column 5.

    • (a[FNR] ? a[FNR] FS : "") is a ternary operation, which is set up to build up the arrays value as a record. It simply asks if the files line number is already in the array. If so, add the arrays value followed by the default file separator before adding the fifth column. Else, if the line number is not in the array, don't prepend anything, just let it equal the fifth column.

  • At the end of the script:

    • Use a C-style loop to iterate through the array, printing each of the arrays values.
like image 198
Steve Avatar answered Oct 13 '22 00:10

Steve