I have very large genotype files that are basically impossible to open in R, so I am trying to extract the rows and columns of interest using linux command line. Rows are straightforward enough using head/tail, but I'm having difficulty figuring out how to handle the columns.
If I attempt to extract (say) the 100-105th tab or space delimited column using
cut -c100-105 myfile >outfile
this obviously won't work if there are strings of multiple characters in each column. Is there some way to modify cut with appropriate arguments so that it extracts the entire string within a column, where columns are defined as space or tab (or any other character) delimited?
Explanation: cut command is used for cutting specific columns.
A nifty command called cut lets you select a list of columns or fields from one or more files. You must specify either the -c option to cut by column or -f to cut by fields. (Fields are separated by tabs unless you specify a different field separator with -d.
1) The cut command is used to display selected parts of file content in UNIX. 2) The default delimiter in cut command is "tab", you can change the delimiter with the option "-d" in the cut command. 3) The cut command in Linux allows you to select the part of the content by bytes, by character, and by field or column.
If the command should work with both tabs and spaces as the delimiter I would use awk
:
awk '{print $100,$101,$102,$103,$104,$105}' myfile > outfile
As long as you just need to specify 5 fields it is imo ok to just type them, for longer ranges you can use a for
loop:
awk '{for(i=100;i<=105;i++)print $i}' myfile > outfile
If you want to use cut
, you need to use the -f
option:
cut -f100-105 myfile > outfile
If the field delimiter is different from TAB
you need to specify it using -d
:
cut -d' ' -f100-105 myfile > outfile
Check the man page for more info on the cut command.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With