We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1). inputFile.txt Tab delimited text file <pre class="prettyprint"><code>header 62 9 3 54 6 1 25 1 2 3 4 5 6 96 1 1 1 1 0 1 72 3 3 3 3 3 3 18 0 1 0 1 1 0 82 1 0 0 0 0 1 77 1 0 1 0 1 1 15 7 7 7 7 7 7 82 0 0 1 1 1 0 37 0 1 0 0 1 0 18 0 1 0 0 1 0 53 0 0 1 0 0 0 57 1 1 1 1 1 1 </code></pre> subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K. <pre class="prettyprint"><code>1,4,6 </code></pre> subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300. <pre class="prettyprint"><code>1,3,7 </code></pre> Current solution using cut and awk loop (Related post: Select rows using awk): <pre class="prettyprint"><code># define vars fileInput=inputFile.txt fileRows=subsetRows.txt fileCols=subsetCols.txt fileOutput=result.txt # cut columns and awk rows cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput </code></pre> Output file: result.txt <pre class="prettyprint"><code>1 4 6 3 3 3 7 7 7 </code></pre> Question: This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit. Any better way? Real input files info: <pre class="prettyprint"><code># $fileInput: # Rows = 20127 # Cols = 533633 # Size = 31 GB # $fileCols: 12000 comma separated col numbers # $fileRows: 300 comma separated row numbers </code></pre> More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r. System: <pre class="prettyprint"><code>$ uname -a Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux </code></pre> <hr> Update: Solution provided below by @JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: <code>GNU Awk 3.1.7</code>

One in Gnu awk version 4.0 or later as column ordering relies on <code>for</code> and <code>PROCINFO["sorted_in"]</code>. The row and col numbers are read from files: <pre class="prettyprint"><code>$ awk ' BEGIN { PROCINFO["sorted_in"]="@ind_num_asc"; } FILENAME==ARGV[1] { # process rows file n=split($0,t,","); for(i=1;i<=n;i++) r[t[i]] } FILENAME==ARGV[2] { # process cols file m=split($0,t,","); for(i=1;i<=m;i++) c[t[i]] } FILENAME==ARGV[3] && ((FNR-1) in r) { # process data file for(i in c) printf "%s%s", $(i+1), (++j%m?OFS:ORS) }' subsetRows.txt subsetCols.txt inputFile.txt 1 4 6 3 3 3 7 7 7 </code></pre> Some performance gain could probably come from moving the <code>ARGV[3]</code> processing block to the top berore 1 and 2 and adding a <code>next</code> to it's end.

Subset a file by row and column numbers

Tags:

We want to subset a text file on rows and columns, where rows and columns numbers are read from a file. Excluding header (row 1) and rownames (col 1).

inputFile.txt Tab delimited text file

header  62  9   3   54  6   1
25  1   2   3   4   5   6
96  1   1   1   1   0   1
72  3   3   3   3   3   3
18  0   1   0   1   1   0
82  1   0   0   0   0   1
77  1   0   1   0   1   1
15  7   7   7   7   7   7
82  0   0   1   1   1   0
37  0   1   0   0   1   0
18  0   1   0   0   1   0
53  0   0   1   0   0   0
57  1   1   1   1   1   1

subsetCols.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 500K columns, and need to subset ~10K.

1,4,6

subsetRows.txt Comma separated with no spaces, one row, numbers ordered. In real data we have 20K rows, and need to subset about ~300.

1,3,7

Current solution using cut and awk loop (Related post: Select rows using awk):

# define vars
fileInput=inputFile.txt
fileRows=subsetRows.txt
fileCols=subsetCols.txt
fileOutput=result.txt

# cut columns and awk rows
cut -f2- $fileInput | cut -f`cat $fileCols` | sed '1d' | awk -v s=`cat $fileRows` 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' > $fileOutput

Output file: result.txt

1   4   6
3   3   3
7   7   7

Question:
This solution works fine for small files, for bigger files 50K rows and 200K columns, it is taking too long, 15 minutes plus, still running. I think cutting the columns works fine, selecting rows is the slow bit.

Any better way?

Real input files info:

# $fileInput:
#        Rows = 20127
#        Cols = 533633
#        Size = 31 GB
# $fileCols: 12000 comma separated col numbers
# $fileRows: 300 comma separated row numbers

More information about the file: file contains GWAS genotype data. Every row represents sample (individual) and every column represents SNP. For further region based analysis we need to subset samples(rows) and SNPs(columns), to make the data more manageable (small) as an input for other statistical softwares like r.

System:

$ uname -a
Linux nYYY-XXXX ZZZ Tue Dec 18 17:22:54 CST 2012 x86_64 x86_64 x86_64 GNU/Linux

Update: Solution provided below by @JamesBrown was mixing the orders of columns in my system, as I am using different version of awk, my version is: GNU Awk 3.1.7

854

asked Nov 28 '16 10:11

zx8754

2 Answers

Even though in If programming languages were countries, which country would each language represent? they say that...

Awk: North Korea. Stubbornly resists change, and its users appear to be unnaturally fond of it for reasons we can only speculate on.

... whenever you see yourself piping sed, cut, grep, awk, etc, stop and say to yourself: awk can make it alone!

So in this case it is a matter of extracting the rows and columns (tweaking them to exclude the header and first column) and then just buffering the output to finally print it.

awk -v cols="1 4 6" -v rows="1 3 7" '
    BEGIN{
       split(cols,c); for (i in c) col[c[i]]  # extract cols to print
       split(rows,r); for (i in r) row[r[i]]  # extract rows to print
    }
    (NR-1 in row){
       for (i=2;i<=NF;i++) 
              (i-1) in col && line=(line ? line OFS $i : $i); # pick columns
              print line; line=""                             # print them
    }' file

With your sample file:

$ awk -v cols="1 4 6" -v rows="1 3 7" 'BEGIN{split(cols,c); for (i in c) col[c[i]]; split(rows,r); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' file
1 4 6
3 3 3
7 7 7

With your sample file, and inputs as variables, split on comma:

awk -v cols="$(<$fileCols)" -v rows="$(<$fileRows)" 'BEGIN{split(cols,c, /,/); for (i in c) col[c[i]]; split(rows,r, /,/); for (i in r) row[r[i]]} (NR-1 in row){for (i=2;i<=NF;i++) (i-1) in col && line=(line ? line OFS $i : $i); print line; line=""}' $fileInput

I am quite sure this will be way faster. You can for example check Remove duplicates from text file based on second text file for some benchmarks comparing the performance of awk over grep and others.

answered Sep 29 '22 23:09

fedorqui 'SO stop harming'

One in Gnu awk version 4.0 or later as column ordering relies on for and PROCINFO["sorted_in"]. The row and col numbers are read from files:

$ awk '
BEGIN {
    PROCINFO["sorted_in"]="@ind_num_asc";
}
FILENAME==ARGV[1] {                       # process rows file
    n=split($0,t,","); 
    for(i=1;i<=n;i++) r[t[i]]
} 
FILENAME==ARGV[2] {                       # process cols file
    m=split($0,t,","); 
    for(i=1;i<=m;i++) c[t[i]]
} 
FILENAME==ARGV[3] && ((FNR-1) in r) {     # process data file
    for(i in c) 
        printf "%s%s", $(i+1), (++j%m?OFS:ORS)
}' subsetRows.txt subsetCols.txt inputFile.txt   
1 4 6
3 3 3
7 7 7

Some performance gain could probably come from moving the ARGV[3] processing block to the top berore 1 and 2 and adding a next to it's end.

answered Sep 30 '22 00:09

James Brown

Related questions
                            
                                Visual Studio 2015 does not find existing Web Apps on Azure Cloud during Web Deploy
                            
                                What is Leading Margin in Android?
                            
                                Why does Spark think this is a cross / Cartesian join
                            
                                How to resolve "TypeError: NetworkError when attempting to fetch resource."
                            
                                Remove type hints in Python source programmatically
                            
                                VS 2017 not discovering tests on .NET Core
                            
                                weak vs unowned in Swift. What are the internal differences?
                            
                                How to use private NPM packages with cloud functions?
                            
                                Why does Fish shell have dark blue as the default color for directories
                            
                                How do you make a GET request in Rust?
                            
                                Record Sound on Ubuntu Docker Image
                            
                                Difference between @Self and @Host Angular 2+ Dependency Injection Decorators

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With