Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting string to int is too slow

I've got a program that reads in 3 strings per line for 50000. It then does other things. The part that reads the file and converts to integers is taking 80% of the total running time.

My code snippet is below:

import time
file = open ('E:/temp/edges_big.txt').readlines()
start_time = time.time()
for line in file[1:]:
    label1, label2, edge = line.strip().split()
    # label1 = int(label1); label2 = int(label2); edge = float(edge)
    # Rest of the loop deleted
print ('processing file took ', time.time() - start_time, "seconds")

The above takes about 0.84 seconds. Now, when I uncomment the line

label1 = int(label1);label2 = int(label2);edge = float(edge)

the runtime rises to about 3.42 seconds.

The input file is in the form: str1 str2 str3 per line

Are the functions int() and float() that slow? How could I optimize this?

like image 419
James Otigo Avatar asked Dec 13 '12 09:12

James Otigo


People also ask

What's the fastest way to convert a string to number?

Use the parseInt() function to convert a string to a number, e.g. const num1 = parseInt(str) .

What happens if you convert a string to int?

Use Integer.parseInt() to Convert a String to an Integer This method returns the string as a primitive type int. If the string does not contain a valid integer then it will throw a NumberFormatException.

Is Atoi fast?

Atoi is the fastest I could come up with. I compiled with msvc 2010 so it might be possible to combine both templates. In msvc 2010, when I combined templates it made the case where you provide a cb argument slower.

What happens if you convert string to int python?

While converting from string to int you may get ValueError exception. This exception occurs if the string you want to convert does not represent any numbers. Suppose, you want to convert a hexadecimal number to an integer. But you did not pass argument base=16 in the int() function.


1 Answers

If the file is in OS cache then parsing the file takes milliseconds on my machine:

name                                 time ratio comment
read_read                        145 usec  1.00 big.txt
read_readtxt                    2.07 msec 14.29 big.txt
read_readlines                  4.94 msec 34.11 big.txt
read_james_otigo                29.3 msec 201.88 big.txt
read_james_otigo_with_int_float 82.9 msec 571.70 big.txt
read_map_local                  93.1 msec 642.23 big.txt
read_map                        95.6 msec 659.57 big.txt
read_numpy_loadtxt               321 msec 2213.66 big.txt

Where the read_*() functions are:

def read_read(filename):
    with open(filename, 'rb') as file:
        data = file.read()

def read_readtxt(filename):
    with open(filename, 'rU') as file:
        text = file.read()

def read_readlines(filename):
    with open(filename, 'rU') as file:
        lines = file.readlines()

def read_james_otigo(filename):
    file = open (filename).readlines()
    for line in file[1:]:
        label1, label2, edge = line.strip().split()

def read_james_otigo_with_int_float(filename):
    file = open (filename).readlines()
    for line in file[1:]:
        label1, label2, edge = line.strip().split()
        label1 = int(label1); label2 = int(label2); edge = float(edge)

def read_map(filename):
    with open(filename) as file:
        L = [(int(l1), int(l2), float(edge))
             for line in file
             for l1, l2, edge in [line.split()] if line.strip()]

def read_map_local(filename, _i=int, _f=float):
    with open(filename) as file:
        L = [(_i(l1), _i(l2), _f(edge))
             for line in file
             for l1, l2, edge in [line.split()] if line.strip()]

import numpy as np

def read_numpy_loadtxt(filename):
    a = np.loadtxt('big.txt', dtype=[('label1', 'i'),
                                     ('label2', 'i'),
                                     ('edge', 'f')])

And big.txt is generated using:

#!/usr/bin/env python
import numpy as np

n = 50000
a = np.random.random_integers(low=0, high=1<<10, size=2*n).reshape(-1, 2)
np.savetxt('big.txt', np.c_[a, np.random.rand(n)], fmt='%i %i %s')

It produces 50000 lines:

150 952 0.355243621018
582 98 0.227592557278
478 409 0.546382780254
46 879 0.177980983303
...

To reproduce results, download the code and run:

# write big.txt
python generate-file.py
# run benchmark
python read-array.py
like image 96
jfs Avatar answered Sep 21 '22 15:09

jfs