How to split columns faster in python?

Question

I have a file like this:

NODE_1_length_4014_cov_1.97676  1   1
NODE_1_length_4014_cov_1.97676  2   1
NODE_1_length_4014_cov_1.97676  3   1
NODE_1_length_4014_cov_1.97676  4   1
NODE_1_length_4014_cov_1.97676  5   1
NODE_1_length_4014_cov_1.97676  6   1
NODE_1_length_4014_cov_1.97676  7   1
NODE_1_length_4014_cov_1.97676  8   1
NODE_1_length_4014_cov_1.97676  9   1
NODE_1_length_4014_cov_1.97676  10  1

I want to split the first column into new columns based on the character '_', like this:

1 4014 1.97676  1   1
1 4014 1.97676  2   1
1 4014 1.97676  3   1
1 4014 1.97676  4   1
1 4014 1.97676  5   1
1 4014 1.97676  6   1
1 4014 1.97676  7   1
1 4014 1.97676  8   1
1 4014 1.97676  9   1
1 4014 1.97676  10  1

I know I can do it in pandas:

df.columns = ['contig','loci','depth']
df['node'] =df.contig.str.split(r'_').str[1]
df['len'] =df.contig.str.split(r'_').str[3]
df['cov'] =df.contig.str.split(r'_').str[5]
df.drop(['contig'], axis=1, inplace=True)

But my file is too large (>2G) and it takes forever on my computer to do this simple thing. Is there any other faster method to do this? Pandas is generally very slow dealing with large files based on my experience, although it is easy to use.

Thank you very much.

Akshay Hegde · Accepted Answer

Using awk

$ awk -F'_' '{print $2, $4,$6,$7,$8}' infile
1 4014 1.97676  1   1  
1 4014 1.97676  2   1  
1 4014 1.97676  3   1  
1 4014 1.97676  4   1  
1 4014 1.97676  5   1  
1 4014 1.97676  6   1  
1 4014 1.97676  7   1  
1 4014 1.97676  8   1  
1 4014 1.97676  9   1  
1 4014 1.97676  10  1

Sundeep · Answer

You can also use cut

$ cut -d_ --output-delimiter=$'	' -f2,4,6 ip.txt
1   4014    1.97676  1   1
1   4014    1.97676  2   1
1   4014    1.97676  3   1
1   4014    1.97676  4   1
1   4014    1.97676  5   1
1   4014    1.97676  6   1
1   4014    1.97676  7   1
1   4014    1.97676  8   1
1   4014    1.97676  9   1
1   4014    1.97676  10  1

-d option lets to specify input delimiter
--output-delimiter=$' ' to specify output delimiter. Note the use of $' ' for Tab character (not sure if this works for all shells, I tested it on bash)

Speed comparison

$ cut --version | head -n1
cut (GNU coreutils) 8.25
$ awk --version | head -n1
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

$ perl -0777 -lne 'print $_ x 1000000' ip.txt > f1
$ du -h f1
363M    f1

$ time cut -d_ --output-delimiter=$'	' -f2,4,6 f1 > t1

real    0m2.097s
user    0m1.660s
sys     0m0.324s
$ time cut -d_ --output-delimiter=$'	' -f2,4,6 f1 > t1

real    0m3.267s
user    0m1.612s
sys     0m0.376s

$ time awk -F'_' -v OFS='	' '{print $2,$4,$6}' f1 > t2

real    0m17.394s
user    0m16.200s
sys     0m0.328s
$ time awk -F'_' -v OFS='	' '{print $2,$4,$6}' f1 > t2

real    0m16.329s
user    0m15.336s
sys     0m0.432s

$ diff -s t1 t2
Files t1 and t2 are identical

If input is ASCII, awk shows lot of improvement

$ time LC_ALL=C awk -F'_' -v OFS='	' '{print $2,$4,$6}' f1 > t2

real    0m7.783s
user    0m6.832s
sys     0m0.428s

How to split columns faster in python?

Tags:

python

bash

pandas

awk

Steve Xu

2 Answers

Akshay Hegde

Sundeep

Recent Activity

Donate For Us

How to split columns faster in python?

Tags:

python

bash

pandas

awk

Steve Xu

2 Answers

Akshay Hegde

Sundeep

Related questions

Recent Activity

Donate For Us