I have a pandas dataframe where I have a column <code>values</code> like this: <pre class="prettyprint"><code>0 16 0 1 7 1 2 0 2 5 3 1 4 18 </code></pre> What I want is to create another column, <code>modified_values</code>, that contains a list of all the different numbers that I will get after splitting each value. The new column will be like this: <pre class="prettyprint"><code>0 [16, 0] 1 [7, 1, 2, 0] 2 [5] 3 [1] 4 [18] </code></pre> Beware the values in this list should be <code>int</code> and not <code>strings</code>. Things that I am aware of: 1) I can split the column in a vectorized way like this <code>df.values.str.split(" ")</code>. This will give me the list but the objects inside the list will be strings. I can add another operation on top of that like this <code>df.values.str.split(" ").apply(func to convert values to int)</code> but that wouldn't be vectorized 2) I can directly do this <code>df['modified_values']= df['values'].apply(func that splits as well as converts to int)</code> The second one will be much slower than the first for sure but I am wondering if the same thing can be achieved in a vectorized way.

<h3>No native "vectorised" solution is possible</h3> I'm highlighting this because it's a common mistake to assume <code>pd.Series.str</code> methods are vectorised. They aren't. They offer convenience and error-handling at the cost of efficiency. For clean data only, e.g. no <code>NaN</code> values, a list comprehension is likely your best option: <pre class="prettyprint"><code>df = pd.DataFrame({'A': ['16 0', '7 1 2 0', '5', '1', '18']}) df['B'] = [list(map(int, i.split())) for i in df['A']] print(df) A B 0 16 0 [16, 0] 1 7 1 2 0 [7, 1, 2, 0] 2 5 [5] 3 1 [1] 4 18 [18] </code></pre> <h3>Performance benchmarking</h3> To illustrate performance issues with <code>pd.Series.str</code>, you can see for larger dataframes how the more operations you pass to Pandas, the more performance deteriorates: <pre class="prettyprint"><code>df = pd.concat([df]*10000) %timeit [list(map(int, i.split())) for i in df['A']] # 55.6 ms %timeit [list(map(int, i)) for i in df['A'].str.split()] # 80.2 ms %timeit df['A'].str.split().apply(lambda x: list(map(int, x))) # 93.6 ms </code></pre> <h3> <code>list</code> as elements in <code>pd.Series</code> is also anti-Pandas</h3> As described here, holding lists in series gives 2 layers of pointers and is not recommended: <blockquote> Don't do this. Pandas was never designed to hold lists in series / columns. You can concoct expensive workarounds, but these are not recommended. The main reason holding lists in series is not recommended is you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks. Your series will be of <code>object</code> dtype, which represents a sequence of pointers, much like <code>list</code>. You will lose benefits in terms of memory and performance, as well as access to optimized Pandas methods. See also <a href="https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists">What are the advantages of NumPy over regular Python lists?</a> The arguments in favour of Pandas are the same as for NumPy. </blockquote>

Splitting a string into list and converting the items to int

Tags:

python

string

python-3.x

pandas

numpy

I have a pandas dataframe where I have a column values like this:

Click to copy

0       16 0
1    7 1 2 0
2          5
3          1
4         18

What I want is to create another column, modified_values, that contains a list of all the different numbers that I will get after splitting each value. The new column will be like this:

Click to copy

0       [16, 0]
1    [7, 1, 2, 0]
2          [5]
3          [1]
4         [18]

Beware the values in this list should be int and not strings.

Things that I am aware of:

1) I can split the column in a vectorized way like this df.values.str.split(" "). This will give me the list but the objects inside the list will be strings. I can add another operation on top of that like this df.values.str.split(" ").apply(func to convert values to int) but that wouldn't be vectorized

2) I can directly do this df['modified_values']= df['values'].apply(func that splits as well as converts to int)

The second one will be much slower than the first for sure but I am wondering if the same thing can be achieved in a vectorized way.

351

asked Oct 20 '18 15:10

mlRocks

2 Answers

No native "vectorised" solution is possible

I'm highlighting this because it's a common mistake to assume pd.Series.str methods are vectorised. They aren't. They offer convenience and error-handling at the cost of efficiency. For clean data only, e.g. no NaN values, a list comprehension is likely your best option:

Click to copy

df = pd.DataFrame({'A': ['16 0', '7 1 2 0', '5', '1', '18']})

df['B'] = [list(map(int, i.split())) for i in df['A']]

print(df)

         A             B
0     16 0       [16, 0]
1  7 1 2 0  [7, 1, 2, 0]
2        5           [5]
3        1           [1]
4       18          [18]

Performance benchmarking

To illustrate performance issues with pd.Series.str, you can see for larger dataframes how the more operations you pass to Pandas, the more performance deteriorates:

Click to copy

df = pd.concat([df]*10000)

%timeit [list(map(int, i.split())) for i in df['A']]            # 55.6 ms
%timeit [list(map(int, i)) for i in df['A'].str.split()]        # 80.2 ms
%timeit df['A'].str.split().apply(lambda x: list(map(int, x)))  # 93.6 ms

`list` as elements in `pd.Series` is also anti-Pandas

As described here, holding lists in series gives 2 layers of pointers and is not recommended:

Don't do this. Pandas was never designed to hold lists in series / columns. You can concoct expensive workarounds, but these are not recommended.

The main reason holding lists in series is not recommended is you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks. Your series will be of object dtype, which represents a sequence of pointers, much like list. You will lose benefits in terms of memory and performance, as well as access to optimized Pandas methods.

See also What are the advantages of NumPy over regular Python lists? The arguments in favour of Pandas are the same as for NumPy.

106

answered Sep 23 '22 05:09

jpp

The double for comprehension is 33% faster than the map comprehension from the jpp's answer. Numba trick is 250 times faster than the map comprehension from jpp's answer, but you get a pandas DataFrame with floats and nan's and not a series of lists. Numba is included in Anaconda.

Benchmarks:

Click to copy

%timeit pd.DataFrame(nb_calc(df.A))            # numba trick       0.144 ms
%timeit [int(x) for i in df['A'] for x in i.split()]            # 23.6   ms
%timeit [list(map(int, i.split())) for i in df['A']]            # 35.6   ms
%timeit [list(map(int, i)) for i in df['A'].str.split()]        # 50.9   ms
%timeit df['A'].str.split().apply(lambda x: list(map(int, x)))  # 56.6   ms

Code for Numba function:

Click to copy

@numba.jit(nopython=True, nogil=True)
def str2int_nb(nb_a):
    n1 = nb_a.shape[0]
    n2 = nb_a.shape[1]
    res = np.empty(nb_a.shape)
    res[:] = np.nan
    j_res_max = 0
    for i in range(n1):
        j_res = 0
        s = 0
        for j in range(n2):
            x = nb_a[i,j]
            if x == 32:
                res[i,j_res]=np.float64(s)
                s=0
                j_res+=1
            elif x == 0:
                break
            else:
                s=s*10+x-48
        res[i,j_res]=np.float64(s)
        if j_res>j_res_max:
            j_res_max = j_res

    return res[:,:j_res_max+1]

def nb_calc(s):
    a_temp = s_a.values.astype("U")
    nb_a = a_temp.view("uint32").reshape(len(s_a),-1).astype(np.int8)
    str2int_nb(nb_a)

Numba does not support strings. So I first convert to array of int8 and only then work with it. Conversion to int8 actually takes 3/4 of the execution time.

The output of my numba function looks like this:

Click to copy

      0    1    2    3
-----------------------
0  16.0  0.0  NaN  NaN
1   7.0  1.0  2.0  0.0
2   5.0  NaN  NaN  NaN
3   1.0  NaN  NaN  NaN
4  18.0  NaN  NaN  NaN

answered Sep 22 '22 05:09

keiv.fly

Related questions
                            
                                How do I export a single function as a module in Python? [duplicate]
                            
                                Pandas: Find rows where a particular column is not NA but all other columns are
                            
                                How do I import all functions from a package in python?
                            
                                Numpy efficient matrix self-multiplication (gram matrix)
                            
                                How to save result of printSchema to a file in PySpark
                            
                                How to have a class return a list when list() is called on an instance of it
                            
                                How to display the images side by side in jupyter notebook
                            
                                Cannot run Spyder because No module named 'PySide'
                            
                                How to use python to separate two gaussian curves?
                            
                                Pip not working - ModuleNotFoundError: No module named 'runpy'
                            
                                L1 norm instead of L2 norm for cost function in regression model
                            
                                how to create singleton class with arguments in python
                            
                                Slice operator with end index 0 [duplicate]
                            
                                How to hide google map api key in django before pushing it on github?
                            
                                Group by a column to find the most frequent value in another column? [duplicate]
                            
                                KeyError: <class 'pandas._libs.tslibs.timestamps.Timestamp'> when saving dataframe to excel
                            
                                loss calculation over different batch sizes in keras
                            
                                How to generate dynamic function name and call it using user input in Python
                            
                                pandas add a column with only one row
                            
                                Pandas, convert datetime format mm/dd/yyyy to dd/mm/yyyy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Splitting a string into list and converting the items to int

Tags:

python

string

python-3.x

pandas

numpy

mlRocks

People also ask

2 Answers

No native "vectorised" solution is possible

Performance benchmarking

`list` as elements in `pd.Series` is also anti-Pandas

jpp

keiv.fly

Recent Activity

Donate For Us

Splitting a string into list and converting the items to int

Tags:

python

string

python-3.x

pandas

numpy

mlRocks

People also ask

2 Answers

No native "vectorised" solution is possible

Performance benchmarking

list as elements in pd.Series is also anti-Pandas

jpp

keiv.fly

Related questions

Recent Activity

Donate For Us

`list` as elements in `pd.Series` is also anti-Pandas