I have very large genotype files that are basically impossible to open in R, so I am trying to extract the rows and columns of interest using linux command line. Rows are straightforward enough using head/tail, but I'm having difficulty figuring out how to handle the columns. If I attempt to extract (say) the 100-105th tab or space delimited column using <pre class="prettyprint"><code> cut -c100-105 myfile >outfile </code></pre> this obviously won't work if there are strings of multiple characters in each column. Is there some way to modify cut with appropriate arguments so that it extracts the entire string within a column, where columns are defined as space or tab (or any other character) delimited?

If the command should work with both tabs and spaces as the delimiter I would use <code>awk</code>: <pre class="prettyprint"><code>awk '{print $100,$101,$102,$103,$104,$105}' myfile > outfile </code></pre> As long as you just need to specify 5 fields it is imo ok to just type them, for longer ranges you can use a <code>for</code> loop: <pre class="prettyprint"><code>awk '{for(i=100;i<=105;i++)print $i}' myfile > outfile </code></pre> <hr> If you want to use <code>cut</code>, you need to use the <code>-f</code> option: <pre class="prettyprint"><code>cut -f100-105 myfile > outfile </code></pre> If the field delimiter is different from <code>TAB</code> you need to specify it using <code>-d</code>: <pre class="prettyprint"><code>cut -d' ' -f100-105 myfile > outfile </code></pre> Check the man page for more info on the cut command.

Extracting columns from text file with different delimiters in Linux

Tags:

linux

I have very large genotype files that are basically impossible to open in R, so I am trying to extract the rows and columns of interest using linux command line. Rows are straightforward enough using head/tail, but I'm having difficulty figuring out how to handle the columns.

If I attempt to extract (say) the 100-105th tab or space delimited column using

 cut -c100-105 myfile >outfile

this obviously won't work if there are strings of multiple characters in each column. Is there some way to modify cut with appropriate arguments so that it extracts the entire string within a column, where columns are defined as space or tab (or any other character) delimited?

659

asked Nov 13 '13 16:11

user1815498

1 Answers

If the command should work with both tabs and spaces as the delimiter I would use awk:

awk '{print $100,$101,$102,$103,$104,$105}' myfile > outfile

As long as you just need to specify 5 fields it is imo ok to just type them, for longer ranges you can use a for loop:

awk '{for(i=100;i<=105;i++)print $i}' myfile > outfile

If you want to use cut, you need to use the -f option:

cut -f100-105 myfile > outfile

If the field delimiter is different from TAB you need to specify it using -d:

cut -d' ' -f100-105 myfile > outfile

Check the man page for more info on the cut command.

185

answered Sep 21 '22 00:09

hek2mgl

Related questions
                            
                                Invalid string: control characters from U+0000 through U+001F must be escaped using Bash? [duplicate]
                            
                                Excessive mysterious system time use in a GHC-compiled binary
                            
                                In GTK/Linux, what's the correct way to get the DPI scale factor?
                            
                                Creating a full directory tree at once
                            
                                Best practices for git repositories on open source projects
                            
                                .NET decompiler for Mac or Linux
                            
                                Command to see 'R' path that RStudio is using
                            
                                Fast string search in a very large file
                            
                                "git add" returning "fatal: outside repository" error
                            
                                How do you change the MIME type of a file from the terminal?
                            
                                Use of Recv-Q and Send-Q
                            
                                rdtsc accuracy across CPU cores
                            
                                How to check for inf (and | or) NaN in a double variable
                            
                                Linux capabilities (setcap) seems to disable LD_LIBRARY_PATH
                            
                                Is it possible to signal handler to survive after "exec"?
                            
                                How can I get the source code for the linux utility tail?
                            
                                How can I print to the console in color in a cross-platform manner?
                            
                                Watch a memory range in gdb?
                            
                                Maintaining file permissions across SVN updates?
                            
                                How to create a new Linux kernel scheduler

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With