What is the fastest way to remove a number from the beginning of so many files?

Question

I have 1000 files each having one million lines. Each line has the following form:

a number,a text

I want to remove all of the numbers from the beginning of every line of every file. including the ,

Example:

14671823,aboasdyflj -> aboasdyflj

What I'm doing is:

os.system("sed -i -- 's/^.*,//g' data/*")

and it works fine but it's taking a huge amount of time.

What is the fastest way to do this?

I'm coding in python.

klutt · Accepted Answer

This is much faster:

cut -f2 -d ',' data.txt > tmp.txt && mv tmp.txt data.txt

On a file with 11 million rows it took less than one second.

To use this on several files in a directory, use:

TMP=/pathto/tmpfile
for file in dir/*; do
    cut -f2 -d ',' "$file" > $TMP && mv $TMP "$file"
done

A thing worth mentioning is that it often takes much longer time to do stuff in place rather than using a separate file. I tried your sed command but switched from in place to a temporary file. Total time went down from 26s to 9s.

What is the fastest way to remove a number from the beginning of so many files?

Tags:

performance

regex

bash

shell

text-processing

yukashima huksay

1 Answers

klutt

Recent Activity

Donate For Us

What is the fastest way to remove a number from the beginning of so many files?

Tags:

performance

regex

bash

shell

text-processing

yukashima huksay

1 Answers

klutt

Related questions

Recent Activity

Donate For Us