I would like to have your advice/help on how to subset a big file (millions of rows or lines). For example, (1) I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000. (2) I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000. I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job. Could you please give any advice? Thanks in advance.

Filtering rows is easy, for example with AWK: <pre class="prettyprint"><code>cat largefile | awk 'NR >= 10000 && NR <= 100000 { print }' </code></pre> Filtering columns is easier with CUT: <pre class="prettyprint"><code>cat largefile | cut -d '\t' -f 10000-100000 </code></pre> As Rahul Dravid mentioned, <code>cat</code> is not a must here, and as Zsolt Botykai added you can improve performance using: <pre class="prettyprint"><code>awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile cut -d '\t' -f 10000-100000 largefile </code></pre>

Some different solutions: For row ranges: In <code>sed</code> : <pre class="prettyprint"><code>sed -n 10000,100000p somefile.txt </code></pre> For column ranges in <code>awk</code>: <pre class="prettyprint"><code>awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt </code></pre>

how to subset a file - select a numbers of rows or columns

2 Answers

Filtering rows is easy, for example with AWK:

cat largefile | awk 'NR >= 10000  && NR <= 100000 { print }'

Filtering columns is easier with CUT:

cat largefile | cut -d '\t' -f 10000-100000

As Rahul Dravid mentioned, cat is not a must here, and as Zsolt Botykai added you can improve performance using:

awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile cut -d '\t' -f 10000-100000 largefile

150

answered Oct 18 '22 21:10

Drakosha

Some different solutions:

For row ranges: In sed :

sed -n 10000,100000p somefile.txt

For column ranges in awk:

awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt

answered Oct 18 '22 22:10

Vijay

Related questions
                            
                                How to access the stack in UINavigationController
                            
                                Shortcut for calling all the setter methods on an object in Eclipse?
                            
                                Change Android LogCat font in Eclipse
                            
                                Suppress directory names being listed with DIR
                            
                                JSON Java check element is a JSONArray or JSONObject
                            
                                Eclipse XSD editor
                            
                                Why am I getting "Invalid algorithm specified" exception
                            
                                How to make a transparent window with Qt Quick?
                            
                                Rails: Remove element from array of hashes
                            
                                Backbone.js turning off wrap by div in render
                            
                                box-shadow both inset and outside on same div
                            
                                How can I store an integer in a nodejs Buffer?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to subset a file - select a numbers of rows or columns

Tags:

jianfeng.mao

People also ask

2 Answers

Drakosha

Vijay

Recent Activity

Donate For Us