Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast string to integer conversion in Python

A simple problem, really: you have one billion (1e+9) unsigned 32-bit integers stored as decimal ASCII strings in a TSV (tab-separated values) file. Conversion using int() is horribly slow compared to other tools working on the same dataset. Why? And more importantly: how to make it faster?

Therefore the question: what is the fastest way possible to convert a string to an integer, in Python?

What I'm really thinking about is some semi-hidden Python functionality that could be (ab)used for this purpose, not unlike Guido's use of array.array in his "Optimization Anecdote".

Sample data (with tabs expanded to spaces)

38262904        "pfv"              2002-11-15T00:37:20+00:00
12311231        "tnealzref"        2008-01-21T20:46:51+00:00
26783384        "hayb"             2004-02-14T20:43:45+00:00
812874          "qevzasdfvnp"      2005-01-11T00:29:46+00:00
22312733        "bdumtddyasb"      2009-01-17T20:41:04+00:00

The time it takes reading the data is irrelevant here, processing the data is the bottleneck.

Microbenchmarks

All of the following are interpreted languages. The host machine is running 64-bit Linux.

Python 2.6.2 with IPython 0.9.1, ~214k conversions per second (100%):

In [1]: strings = map(str, range(int(1e7)))

In [2]: %timeit map(int, strings);
10 loops, best of 3: 4.68 s per loop

REBOL 3.0 Version 2.100.76.4.2, ~231kcps (108%):

>> strings: array n: to-integer 1e7 repeat i n [poke strings i mold (i - 1)]
== "9999999"

>> delta-time [map str strings [to integer! str]]
== 0:00:04.328675

REBOL 2.7.6.4.2 (15-Mar-2008), ~523kcps (261%):

As John noted in the comments, this version does not build a list of converted integers, so the speed-ratio given is relative to Python's 4.99s runtime of for str in strings: int(str).

>> delta-time: func [c /local t] [t: now/time/precise do c now/time/precise - t]

>> strings: array n: to-integer 1e7 repeat i n [poke strings i mold (i - 1)]
== "9999999"

>> delta-time [foreach str strings [to integer! str]]
== 0:00:01.913193

KDB+ 2.6t 2009.04.15, ~2016kcps (944%):

q)strings:string til "i"$1e7

q)\t "I"$strings
496
like image 411
earl Avatar asked Aug 20 '09 22:08

earl


People also ask

What is the fastest way to convert a string to a number?

Converting strings to numbers is extremely common. The easiest and fastest (jsPerf) way to achieve that would be using the + (plus) operator. You can also use the - (minus) operator which type-converts the value into number but also negates it.

Can you convert string to integer Python?

To convert, or cast, a string to an integer in Python, you use the int() built-in function. The function takes in as a parameter the initial string you want to convert, and returns the integer equivalent of the value you passed.

How do you convert multiple strings to integers in Python?

In Python an strings can be converted into a integer using the built-in int() function. The int() function takes in any python data type and converts it into a integer.

Is Atoi fast?

Atoi is the fastest I could come up with. I compiled with msvc 2010 so it might be possible to combine both templates.


1 Answers

The following most simplistic C extension already improves heavily on the builtin, managing to convert over three times as many strings per second (650kcps vs 214kcps):

static PyObject *fastint_int(PyObject *self, PyObject *args) {
    char *s; unsigned r = 0;
    if (!PyArg_ParseTuple(args, "s", &s)) return NULL;
    for (r = 0; *s; r = r * 10 + *s++ - '0');
    return Py_BuildValue("i", r);
}

This obviously does not cater for integers of arbitrary length and various other special cases, but that's no problem in our scenario.

like image 178
earl Avatar answered Sep 23 '22 01:09

earl