<code>br</code> is the name of a list of strings that goes like this: <pre class="prettyprint"><code>['14 0.000000 -- (long term 0.000000)\n', '19 0.000000 -- (long term 0.000000)\n', '22 0.000000 -- (long term 0.000000)\n', ... </code></pre> I am interested in the first two columns, which I would like to convert to a numpy array. So far, I've come up with the following solution: <pre class="prettyprint"><code>x = N.array ([0., 0.]) for i in br: x = N.vstack ( (x, N.array (map (float, i.split ()[:2]))) ) </code></pre> This results into having a 2-D array: <pre class="prettyprint"><code>array([[ 0., 0.], [ 14., 0.], [ 19., 0.], [ 22., 0.], ... </code></pre> However, since <code>br</code> is rather big (~10^5 entries), this procedure takes some time. I was wondering, is there a way to accomplish the same result, but in less time?

This is dramatically faster for me: <pre class="prettyprint"><code>import numpy as N br = ['14 0.000000 -- (long term 0.000000)\n']*50000 aa = N.zeros((len(br), 2)) for i,line in enumerate(br): al, strs = aa[i], line.split(None, 2)[:2] al[0], al[1] = float(strs[0]), float(strs[1]) </code></pre> Changes: <ul> <li>Preallocate the numpy array (this is big). You already know you want a 2-dimensional array with particular dimensions.</li> <li>Only split() for the first 2 columns, since you don't want the rest.</li> <li>Don't use map(): it's slower than list comprehensions. I didn't even use list comprehensions, since you know you only have 2 columns.</li> <li>Assign directly into the preallocated array instead of generating new temp arrays as you iterate.</li> </ul>

You can try to preprocess (with awk for exemple) the list of strings if they come from a file, and use numpy.fromtxt. If you can't do anything about the way you get this list, you have several possibilities: <ul> <li>give up. You will run this function once a day. You don't care about speed, and your actual solution is good enough</li> <li>write an IO plugin with cython. You have a big potential gain because you will be able to do all the loops in c, and affects directly the values in a big (10^5, 2) numpy ndarray</li> <li>try another language to fix your problem. If using languages such as c or haskell, you may use ctypes to call the functions compiled in a dll from python</li> </ul> edit maybe this approach is slightly faster: <pre class="prettyprint"><code>def conv(mysrt): return map(float, mystr.split()[:2]) br_float = map(conv, br) x = N.array(br_float) </code></pre>

Changing <pre class="prettyprint"><code>map (float, i.split()[:2]) </code></pre> to <pre class="prettyprint"><code>map (float, i.split(' ',2)[:2]) </code></pre> might result in a slight speedup. Since you only care about first two space-separated items in each line there is no need to split the entire line. The <code>2</code> in <code>i.split(' ',2)</code> tells <code>split</code> to just make a maximum of 2 splits. For example, <pre class="prettyprint"><code>In [11]: x='14 0.000000 -- (long term 0.000000)\n' In [12]: x.split() Out[12]: ['14', '0.000000', '--', '(long', 'term', '0.000000)'] In [13]: x.split(' ',2) Out[13]: ['14', '0.000000', '-- (long term 0.000000)\n'] </code></pre>

Converting a list of strings in a numpy array in a faster way

Tags:

python

string

list

numpy

br is the name of a list of strings that goes like this:

['14 0.000000 -- (long term 0.000000)\n',
 '19 0.000000 -- (long term 0.000000)\n',
 '22 0.000000 -- (long term 0.000000)\n',
...

I am interested in the first two columns, which I would like to convert to a numpy array. So far, I've come up with the following solution:

x = N.array ([0., 0.])
for i in br:
    x = N.vstack ( (x, N.array (map (float, i.split ()[:2]))) )

This results into having a 2-D array:

array([[  0.,   0.],
       [ 14.,   0.],
       [ 19.,   0.],
       [ 22.,   0.],
...

However, since br is rather big (~10^5 entries), this procedure takes some time. I was wondering, is there a way to accomplish the same result, but in less time?

770

asked Aug 31 '11 16:08

Jir

3 Answers

This is dramatically faster for me:

import numpy as N

br = ['14 0.000000 -- (long term 0.000000)\n']*50000
aa = N.zeros((len(br), 2))

for i,line in enumerate(br):
    al, strs = aa[i], line.split(None, 2)[:2]
    al[0], al[1] = float(strs[0]), float(strs[1])

Changes:

Preallocate the numpy array (this is big). You already know you want a 2-dimensional array with particular dimensions.
Only split() for the first 2 columns, since you don't want the rest.
Don't use map(): it's slower than list comprehensions. I didn't even use list comprehensions, since you know you only have 2 columns.
Assign directly into the preallocated array instead of generating new temp arrays as you iterate.

190

answered Oct 16 '22 19:10

sunetos

You can try to preprocess (with awk for exemple) the list of strings if they come from a file, and use numpy.fromtxt. If you can't do anything about the way you get this list, you have several possibilities:

give up. You will run this function once a day. You don't care about speed, and your actual solution is good enough
write an IO plugin with cython. You have a big potential gain because you will be able to do all the loops in c, and affects directly the values in a big (10^5, 2) numpy ndarray
try another language to fix your problem. If using languages such as c or haskell, you may use ctypes to call the functions compiled in a dll from python

edit

maybe this approach is slightly faster:

def conv(mysrt):
    return map(float, mystr.split()[:2])

br_float = map(conv, br)
x = N.array(br_float)

answered Oct 16 '22 19:10

Simon Bergot

Changing

map (float, i.split()[:2])

map (float, i.split(' ',2)[:2])

might result in a slight speedup. Since you only care about first two space-separated items in each line there is no need to split the entire line. The 2 in i.split(' ',2) tells split to just make a maximum of 2 splits. For example,

In [11]: x='14 0.000000 -- (long term 0.000000)\n' 

In [12]: x.split()
Out[12]: ['14', '0.000000', '--', '(long', 'term', '0.000000)']

In [13]: x.split(' ',2)
Out[13]: ['14', '0.000000', '-- (long term 0.000000)\n']

answered Oct 16 '22 17:10

unutbu

Related questions
                            
                                Need to create a list of sets, from a list of sets whose members may be connected
                            
                                Output from subprocess.Popen
                            
                                Calculating plugin dependencies
                            
                                Running Scrapy from a script - Hangs
                            
                                What's the difference between a twistd plugin and a twistd service?
                            
                                Keyed Collection in Python?
                            
                                Can I use md5 authentication with psycopg2?
                            
                                MP3 audio file splitting with Python
                            
                                Django: determine which user is deleting when using post_delete signal
                            
                                Python: What is an elegant idiom for extending method with one or more keyword arguments?
                            
                                Numerical integration in Python with adaptive quadrature of vectorized function
                            
                                Find an efficient way to integrate different language libraries into one project using Python as the "glue"
                            
                                Getting the currently selected item in QTreeView
                            
                                Why would the makemessages function for Django language localization ignore html files?
                            
                                Proper way to organize testcases that involve a data file for each testcase?
                            
                                BeautifulSoup parser appends semicolons to naked ampersands, mangling URLs?
                            
                                Which keywords most distinguish two groups of people?
                            
                                python socket hangs on connect
                            
                                Binding <Key> to an Entry in Tkinter
                            
                                How to determine the sum of a group of integers without using recursion

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With