Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to subset a file - select a numbers of rows or columns

Tags:

I would like to have your advice/help on how to subset a big file (millions of rows or lines).

For example,

(1) I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000.

(2) I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000.

I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job.

Could you please give any advice? Thanks in advance.

like image 973
jianfeng.mao Avatar asked Jun 27 '11 10:06

jianfeng.mao


People also ask

What is subset of rows?

A Row Subset is a selection of the rows within a whole table being viewed within the application, or equivalently a new table composed from some subset of its rows. You can define these and use them in several different ways; the usefulness comes from defining them in one context and using them in another.

How do I subset certain rows in R?

By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.

How do I subset multiple variables in R?

To indicate retaining a variable, specify at least one variable name. To specify multiple variables, separate adjacent variables by a comma, and enclose the list within the standard R combine function, c .


2 Answers

Filtering rows is easy, for example with AWK:

cat largefile | awk 'NR >= 10000  && NR <= 100000 { print }' 

Filtering columns is easier with CUT:

cat largefile | cut -d '\t' -f 10000-100000 

As Rahul Dravid mentioned, cat is not a must here, and as Zsolt Botykai added you can improve performance using:

awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile cut -d '\t' -f 10000-100000 largefile  
like image 150
Drakosha Avatar answered Oct 18 '22 21:10

Drakosha


Some different solutions:

For row ranges: In sed :

sed -n 10000,100000p somefile.txt 

For column ranges in awk:

awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt 
like image 33
Vijay Avatar answered Oct 18 '22 22:10

Vijay