I have a file like this:
NODE_1_length_4014_cov_1.97676 1 1
NODE_1_length_4014_cov_1.97676 2 1
NODE_1_length_4014_cov_1.97676 3 1
NODE_1_length_4014_cov_1.97676 4 1
NODE_1_length_4014_cov_1.97676 5 1
NODE_1_length_4014_cov_1.97676 6 1
NODE_1_length_4014_cov_1.97676 7 1
NODE_1_length_4014_cov_1.97676 8 1
NODE_1_length_4014_cov_1.97676 9 1
NODE_1_length_4014_cov_1.97676 10 1
I want to split the first column into new columns based on the character '_', like this:
1 4014 1.97676 1 1
1 4014 1.97676 2 1
1 4014 1.97676 3 1
1 4014 1.97676 4 1
1 4014 1.97676 5 1
1 4014 1.97676 6 1
1 4014 1.97676 7 1
1 4014 1.97676 8 1
1 4014 1.97676 9 1
1 4014 1.97676 10 1
I know I can do it in pandas:
df.columns = ['contig','loci','depth']
df['node'] =df.contig.str.split(r'_').str[1]
df['len'] =df.contig.str.split(r'_').str[3]
df['cov'] =df.contig.str.split(r'_').str[5]
df.drop(['contig'], axis=1, inplace=True)
But my file is too large (>2G) and it takes forever on my computer to do this simple thing. Is there any other faster method to do this? Pandas is generally very slow dealing with large files based on my experience, although it is easy to use.
Thank you very much.
Using awk
$ awk -F'_' '{print $2, $4,$6,$7,$8}' infile
1 4014 1.97676 1 1
1 4014 1.97676 2 1
1 4014 1.97676 3 1
1 4014 1.97676 4 1
1 4014 1.97676 5 1
1 4014 1.97676 6 1
1 4014 1.97676 7 1
1 4014 1.97676 8 1
1 4014 1.97676 9 1
1 4014 1.97676 10 1
You can also use cut
$ cut -d_ --output-delimiter=$'\t' -f2,4,6 ip.txt
1 4014 1.97676 1 1
1 4014 1.97676 2 1
1 4014 1.97676 3 1
1 4014 1.97676 4 1
1 4014 1.97676 5 1
1 4014 1.97676 6 1
1 4014 1.97676 7 1
1 4014 1.97676 8 1
1 4014 1.97676 9 1
1 4014 1.97676 10 1
-d option lets to specify input delimiter--output-delimiter=$'\t' to specify output delimiter. Note the use of $'\t' for Tab character (not sure if this works for all shells, I tested it on bash)
Speed comparison
$ cut --version | head -n1
cut (GNU coreutils) 8.25
$ awk --version | head -n1
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
$ perl -0777 -lne 'print $_ x 1000000' ip.txt > f1
$ du -h f1
363M f1
$ time cut -d_ --output-delimiter=$'\t' -f2,4,6 f1 > t1
real 0m2.097s
user 0m1.660s
sys 0m0.324s
$ time cut -d_ --output-delimiter=$'\t' -f2,4,6 f1 > t1
real 0m3.267s
user 0m1.612s
sys 0m0.376s
$ time awk -F'_' -v OFS='\t' '{print $2,$4,$6}' f1 > t2
real 0m17.394s
user 0m16.200s
sys 0m0.328s
$ time awk -F'_' -v OFS='\t' '{print $2,$4,$6}' f1 > t2
real 0m16.329s
user 0m15.336s
sys 0m0.432s
$ diff -s t1 t2
Files t1 and t2 are identical
If input is ASCII, awk shows lot of improvement
$ time LC_ALL=C awk -F'_' -v OFS='\t' '{print $2,$4,$6}' f1 > t2
real 0m7.783s
user 0m6.832s
sys 0m0.428s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With