Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python sum of ASCII values of all characters in a string

I am searching a more efficient way to sum-up the ASCII values of all characters in a given string, using only standard python (2.7 is preferable).

Currently I have:

print sum(ord(ch) for ch in text)

I want to emphasize that my main focus and aspect of this question is what I wrote above.

The following is somewhat less important aspect of this question and should be treated as such:

So why I am asking it?! I have compared this approach vs embedding a simple C-code function which does the same here using PyInline, and it seems that a simple C embedded function is 17 times faster.

If there is no Python approach faster than what I have suggested (using only standard Python), it seems strange that the Python developers haven't added such an implementation in the core.

Current results for suggested answers. On my Windows 7, i-7, Python 2.7:

 text = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
 sum(ord(ch) for ch in text)
 >> 0.00521324663262
 sum(array.array("B", text))
 >> 0.0010040770317
 sum(map(ord, text ))
 >> 0.00427160369234
 sum(bytearray(text))
 >> 0.000864669402933

 C-code embedded:
 >> 0.000272828426841
like image 761
Michael Avatar asked Sep 19 '12 09:09

Michael


3 Answers

print sum(map(ord,my_string))

This would be the easiest.

like image 139
joe Avatar answered Oct 23 '22 11:10

joe


You can use an intermediate bytearray to speed things up:

>>> sum(bytearray("abcdefgh"))
804

This is not 17 times faster than the generator—it involves the creation of an intermediate bytearray and sum still has to iterate over Python integer objects—but on my machine it does speed up summing an 8-character string from 2μs to about 700ns. If a timing in this ballpark is still too inefficient for your use case, you should probably write the speed-critical parts of your application in C anyway.

If your strings are sufficiently large, and if you can use numpy, you can avoid creating temporary copies by directly referring to the string's buffer using numpy.frombuffer:

>>> import numpy as np
>>> np.frombuffer("abcdefgh", "uint8").sum()
804

For smaller strings this is slower than a temporary array because of the complexities in numpy's view creation machinery. However, for sufficiently large strings, the frombuffer approach starts to pay off, and it of course always creates less garbage. On my machine the cutoff point is string size of about 200 characters.

Also, see Guido's classic essay Python Optimization Anecdote. While some of its specific techniques may by now be obsolete, the general lesson of how to think about Python optimization is still quite relevant.


You can time the different approaches with the timeit module:

$ python -m timeit -s 's = "a" * 20' 'sum(ord(ch) for ch in s)' 
100000 loops, best of 3: 3.85 usec per loop
$ python -m timeit -s 's = "a" * 20' 'sum(bytearray(s))'
1000000 loops, best of 3: 1.05 usec per loop
$ python -m timeit -s 'from numpy import frombuffer; s = "a" * 20' \
                      'frombuffer(s, "uint8").sum()' 
100000 loops, best of 3: 4.8 usec per loop
like image 25
user4815162342 Avatar answered Oct 23 '22 13:10

user4815162342


You can speed it up a bit (~40% ish, but nowhere near as fast as native C) by removing the creation of the generator...

Instead of:

sum(ord(c) for c in string)

Do:

sum(map(ord, string))

Timings:

>>> timeit.timeit(stmt="sum(map(ord, 'abcdefgh'))")
# TP: 1.5709713941578798
# JC: 1.425781011581421
>>> timeit.timeit(stmt="sum(ord(c) for c in 'abcdefgh')")
# TP: 1.7807035140629637
# JC: 1.9981679916381836
like image 12
Jon Clements Avatar answered Oct 23 '22 13:10

Jon Clements