Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split columns faster in python?

I have a file like this:

NODE_1_length_4014_cov_1.97676  1   1
NODE_1_length_4014_cov_1.97676  2   1
NODE_1_length_4014_cov_1.97676  3   1
NODE_1_length_4014_cov_1.97676  4   1
NODE_1_length_4014_cov_1.97676  5   1
NODE_1_length_4014_cov_1.97676  6   1
NODE_1_length_4014_cov_1.97676  7   1
NODE_1_length_4014_cov_1.97676  8   1
NODE_1_length_4014_cov_1.97676  9   1
NODE_1_length_4014_cov_1.97676  10  1

I want to split the first column into new columns based on the character '_', like this:

1 4014 1.97676  1   1
1 4014 1.97676  2   1
1 4014 1.97676  3   1
1 4014 1.97676  4   1
1 4014 1.97676  5   1
1 4014 1.97676  6   1
1 4014 1.97676  7   1
1 4014 1.97676  8   1
1 4014 1.97676  9   1
1 4014 1.97676  10  1

I know I can do it in pandas:

df.columns = ['contig','loci','depth']
df['node'] =df.contig.str.split(r'_').str[1]
df['len'] =df.contig.str.split(r'_').str[3]
df['cov'] =df.contig.str.split(r'_').str[5]
df.drop(['contig'], axis=1, inplace=True)

But my file is too large (>2G) and it takes forever on my computer to do this simple thing. Is there any other faster method to do this? Pandas is generally very slow dealing with large files based on my experience, although it is easy to use.

Thank you very much.

like image 978
Steve Xu Avatar asked Dec 31 '25 10:12

Steve Xu


2 Answers

Using awk

$ awk -F'_' '{print $2, $4,$6,$7,$8}' infile
1 4014 1.97676  1   1  
1 4014 1.97676  2   1  
1 4014 1.97676  3   1  
1 4014 1.97676  4   1  
1 4014 1.97676  5   1  
1 4014 1.97676  6   1  
1 4014 1.97676  7   1  
1 4014 1.97676  8   1  
1 4014 1.97676  9   1  
1 4014 1.97676  10  1  
like image 174
Akshay Hegde Avatar answered Jan 02 '26 23:01

Akshay Hegde


You can also use cut

$ cut -d_ --output-delimiter=$'\t' -f2,4,6 ip.txt
1   4014    1.97676  1   1
1   4014    1.97676  2   1
1   4014    1.97676  3   1
1   4014    1.97676  4   1
1   4014    1.97676  5   1
1   4014    1.97676  6   1
1   4014    1.97676  7   1
1   4014    1.97676  8   1
1   4014    1.97676  9   1
1   4014    1.97676  10  1
  • -d option lets to specify input delimiter
  • --output-delimiter=$'\t' to specify output delimiter. Note the use of $'\t' for Tab character (not sure if this works for all shells, I tested it on bash)


Speed comparison

$ cut --version | head -n1
cut (GNU coreutils) 8.25
$ awk --version | head -n1
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)

$ perl -0777 -lne 'print $_ x 1000000' ip.txt > f1
$ du -h f1
363M    f1

$ time cut -d_ --output-delimiter=$'\t' -f2,4,6 f1 > t1

real    0m2.097s
user    0m1.660s
sys     0m0.324s
$ time cut -d_ --output-delimiter=$'\t' -f2,4,6 f1 > t1

real    0m3.267s
user    0m1.612s
sys     0m0.376s

$ time awk -F'_' -v OFS='\t' '{print $2,$4,$6}' f1 > t2

real    0m17.394s
user    0m16.200s
sys     0m0.328s
$ time awk -F'_' -v OFS='\t' '{print $2,$4,$6}' f1 > t2

real    0m16.329s
user    0m15.336s
sys     0m0.432s

$ diff -s t1 t2
Files t1 and t2 are identical

If input is ASCII, awk shows lot of improvement

$ time LC_ALL=C awk -F'_' -v OFS='\t' '{print $2,$4,$6}' f1 > t2

real    0m7.783s
user    0m6.832s
sys     0m0.428s
like image 25
Sundeep Avatar answered Jan 03 '26 00:01

Sundeep



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!