Is it possible to make this shell script faster?

Tags:

I have a task of creating a script which takes a huge text file as an input. It then needs to find all words and the number of occurrences and create a new file with each line displaying a unique word and its occurrence.

As an example take a file with this content:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor 
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud 
exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure
dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.   
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt 
mollit anim id est laborum.

I need to create a file which looks like this:

1 AD
1 ADIPISICING
1 ALIQUA
...
1 ALIQUIP
1 DO
2 DOLOR
2 DOLORE
...

For this I wrote a script using tr, sort and uniq:

#!/bin/sh
INPUT=$1
OUTPUT=$2
if [ -a $INPUT ]
then
    tr '[:space:][\-_?!.;\:]' '\n' < $INPUT | 
        tr -d '[:punct:][:special:][:digit:]' |
        tr '[:lower:]' '[:upper:]' |
        sort |
        uniq -c > $OUTPUT
fi

What this does is split the words by space as the delimiter. If the word contains -_?!.;: I break them into words again. I remove the punctuations, special characters and digits and convert the entire string to uppercase. Once this is done I sort it and pass it through uniq to get it to the format I want.

Now I downloaded the bible in txt format and used it as the input. Timing this I got:

scripts|$ time ./text-to-word.sh text.txt b     
./text-to-word.sh text.txt b  16.17s user 0.09s system 102% cpu 15.934 total

I did the same with a Python script:

import re
from collections import Counter
from itertools import chain
import sys

file = open(sys.argv[1])

c = Counter()

for line in file.readlines():
    c.update([re.sub('[^a-zA-Z]', '', l).upper()
            for l in chain(*[re.split('[-_?!.;:]', word)
                    for word in line.split()])])

file2 = open('output.txt', 'w')
for key in sorted(c):
    file2.write(key + ' ' + str(c[key]) + '\n')

When I executed the script I got:

scripts|$ time python text-to-word.py text.txt
python text-to-word.py text.txt  7.23s user 0.04s system 97% cpu 7.456 total

As you can see it ran in 7.23s compared to the shell script which ran in 16.17s. I have tried with bigger files and always Python seems to triumph. I have a few questions to the senario above:

Why is the Python script faster given that the shell commands are written in C? I do realize the shell script may not be the optimum one.
How can I improve the shell script?
Can I improve the Python script?

To be clear I am not comparing Python to shell scripts. I am not trying to start a flame war or do not need answers in any other language comparing itself to be faster. Using the UNIX philosophy of piping small commands to do a task, how do I make the shell script faster?

996

asked Aug 16 '12 13:08

satran

2 Answers

An important point here is probably inter-process I/O. The Python script has all data in memory, so no I/O happens while it processes the data.

Also note that Python isn't slow as such. Most functionality in Python is implemented in C.

The shell script has to start 5 processes and each of them has to read the whole text from stdin and write the whole text to stdout four times.

There might be a way to make the Python script a bit faster: You can read the whole text into a single string, then remove all punctuation, split words and then count them:

text = file.read()
text = re.sub(r'[.,:;-_]', '', text)
text = text.upper()
words = re.split(r'\\s+', text)
c = Counter()
c.update(words)

That would avoid the overhead of several nested loops.

As for the shell script: You should try to reduce the number of processes. The three tr processes could probably be replaced with one call to sed.

answered Oct 21 '22 00:10

Aaron Digulla

It isn't a matter of one language vs another. Your approach is different.

In Python, you are incrementing a counter for every word as you encounter it, and then iterating your counter to produce the output. This is going to be O(n).

In bash, you are putting all of your words individually into a long tuple, sorting the tuple, then counting instances. This is most likely going to be O(nlogn) for the sort.

answered Oct 21 '22 01:10

Cargo23

Related questions
                            
                                What is returned by wave.readframes?
                            
                                Subclass builtin List
                            
                                Best resources for learning PyGame? [closed]
                            
                                Google App Engine or Django? [closed]
                            
                                Python finding substring between certain characters using regex and replace()
                            
                                Edit configuration file through python
                            
                                Architecture for providing different linear algebra back-ends
                            
                                Python: How can I check the number of pending tasks in a multiprocessing.Pool?
                            
                                Full text searching XML data with Python: best practices, pros & cons
                            
                                South ignores change in field default value in Python / Django
                            
                                Java/Python communication via message broker
                            
                                ValueError when using strptime to get a datetime object [duplicate]
                            
                                Normalize histogram (brightness and contrast) of a set of images using Python Image Library (PIL)
                            
                                How to design a program with many configuration options?
                            
                                Disabling python's assert() without -0 flag
                            
                                Mock patch is not working with class in __init__.py
                            
                                How to use jQuery UI Datepicker as a Django Widget?
                            
                                Why can't attribute names be Python keywords?
                            
                                Python + GTK: How to set a selected row on gtk.treeview
                            
                                Why does adding to a list do different things? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to make this shell script faster?

Tags:

python

shell

unix

satran

People also ask

2 Answers

Aaron Digulla

Cargo23

Recent Activity

Donate For Us