A simple problem, really: you have one billion (1e+9) unsigned 32-bit integers stored as decimal ASCII strings in a TSV (tab-separated values) file. Conversion using int()
is horribly slow compared to other tools working on the same dataset. Why? And more importantly: how to make it faster?
Therefore the question: what is the fastest way possible to convert a string to an integer, in Python?
What I'm really thinking about is some semi-hidden Python functionality that could be (ab)used for this purpose, not unlike Guido's use of array.array
in his "Optimization Anecdote".
Sample data (with tabs expanded to spaces)
38262904 "pfv" 2002-11-15T00:37:20+00:00
12311231 "tnealzref" 2008-01-21T20:46:51+00:00
26783384 "hayb" 2004-02-14T20:43:45+00:00
812874 "qevzasdfvnp" 2005-01-11T00:29:46+00:00
22312733 "bdumtddyasb" 2009-01-17T20:41:04+00:00
The time it takes reading the data is irrelevant here, processing the data is the bottleneck.
Microbenchmarks
All of the following are interpreted languages. The host machine is running 64-bit Linux.
Python 2.6.2 with IPython 0.9.1, ~214k conversions per second (100%):
In [1]: strings = map(str, range(int(1e7)))
In [2]: %timeit map(int, strings);
10 loops, best of 3: 4.68 s per loop
REBOL 3.0 Version 2.100.76.4.2, ~231kcps (108%):
>> strings: array n: to-integer 1e7 repeat i n [poke strings i mold (i - 1)]
== "9999999"
>> delta-time [map str strings [to integer! str]]
== 0:00:04.328675
REBOL 2.7.6.4.2 (15-Mar-2008), ~523kcps (261%):
As John noted in the comments, this version does not build a list of converted integers, so the speed-ratio given is relative to Python's 4.99s runtime of for str in strings: int(str)
.
>> delta-time: func [c /local t] [t: now/time/precise do c now/time/precise - t]
>> strings: array n: to-integer 1e7 repeat i n [poke strings i mold (i - 1)]
== "9999999"
>> delta-time [foreach str strings [to integer! str]]
== 0:00:01.913193
KDB+ 2.6t 2009.04.15, ~2016kcps (944%):
q)strings:string til "i"$1e7
q)\t "I"$strings
496
Converting strings to numbers is extremely common. The easiest and fastest (jsPerf) way to achieve that would be using the + (plus) operator. You can also use the - (minus) operator which type-converts the value into number but also negates it.
To convert, or cast, a string to an integer in Python, you use the int() built-in function. The function takes in as a parameter the initial string you want to convert, and returns the integer equivalent of the value you passed.
In Python an strings can be converted into a integer using the built-in int() function. The int() function takes in any python data type and converts it into a integer.
Atoi is the fastest I could come up with. I compiled with msvc 2010 so it might be possible to combine both templates.
The following most simplistic C extension already improves heavily on the builtin, managing to convert over three times as many strings per second (650kcps vs 214kcps):
static PyObject *fastint_int(PyObject *self, PyObject *args) {
char *s; unsigned r = 0;
if (!PyArg_ParseTuple(args, "s", &s)) return NULL;
for (r = 0; *s; r = r * 10 + *s++ - '0');
return Py_BuildValue("i", r);
}
This obviously does not cater for integers of arbitrary length and various other special cases, but that's no problem in our scenario.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With