I need to concat the strings in 2 or more columns of a pandas dataframe. I found this answer, which works fine if you don't have any missing value. Unfortunately, I have, and this leads to things like "ValueA; None", which is not really clean. Example data: <pre class="prettyprint"><code>col_A | col_B ------ | ------ val_A | val_B None | val_B val_A | None None | None </code></pre> I need this result: <pre class="prettyprint"><code>col_merge --------- val_A;val_B val_B val_A None </code></pre>

You can use <code>apply</code> with <code>if-else</code>: <pre class="prettyprint"><code>df = df.apply(lambda x: None if x.isnull().all() else ';'.join(x.dropna()), axis=1) print (df) 0 val_A;val_B 1 val_B 2 val_A 3 None dtype: object </code></pre> For faster solution is possible use: <pre class="prettyprint"><code>#add separator and replace NaN to empty space #convert to lists arr = df.add('; ').fillna('').values.tolist() #list comprehension, replace empty spaces to NaN s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True) #replace NaN to None s = s.where(s.notnull(), None) print (s) 0 val_A;val_B 1 val_B 2 val_A 3 None dtype: object </code></pre> <hr> <pre class="prettyprint"><code>#40000 rows df = pd.concat([df]*10000).reset_index(drop=True) In [70]: %%timeit ...: arr = df.add('; ').fillna('').values.tolist() ...: s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True) ...: s.where(s.notnull(), None) ...: 10 loops, best of 3: 74 ms per loop In [71]: %%timeit ...: df.apply(lambda x: None if x.isnull().all() else ';'.join(x.dropna()), axis=1) ...: 1 loop, best of 3: 12.7 s per loop #another solution, but slowier a bit In [72]: %%timeit ...: arr = df.add('; ').fillna('').values ...: s = [''.join(x).strip('; ') for x in arr] ...: pd.Series([y if y != '' else None for y in s]) ...: ...: 10 loops, best of 3: 119 ms per loop </code></pre>

Combine pandas string columns with missing values

Tags:

python

pandas

I need to concat the strings in 2 or more columns of a pandas dataframe.

I found this answer, which works fine if you don't have any missing value. Unfortunately, I have, and this leads to things like "ValueA; None", which is not really clean.

Example data:

col_A  | col_B
------ | ------
val_A  | val_B 
None   | val_B 
val_A  | None 
None   | None

I need this result:

col_merge
---------
val_A;val_B
val_B
val_A
None

374

asked Aug 31 '17 08:08

CoMartel

1 Answers

You can use apply with if-else:

df = df.apply(lambda x: None if x.isnull().all() else ';'.join(x.dropna()), axis=1)
print (df)
0    val_A;val_B
1          val_B
2          val_A
3           None
dtype: object

For faster solution is possible use:

#add separator and replace NaN to empty space
#convert to lists
arr = df.add('; ').fillna('').values.tolist()
#list comprehension, replace empty spaces to NaN
s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True)
#replace NaN to None
s = s.where(s.notnull(), None)
print (s)
0    val_A;val_B
1          val_B
2          val_A
3           None
dtype: object

#40000 rows
df = pd.concat([df]*10000).reset_index(drop=True)

In [70]: %%timeit
    ...: arr = df.add('; ').fillna('').values.tolist()
    ...: s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True)
    ...: s.where(s.notnull(), None)
    ...: 
10 loops, best of 3: 74 ms per loop


In [71]: %%timeit
    ...: df.apply(lambda x: None if x.isnull().all() else ';'.join(x.dropna()), axis=1)
    ...: 
1 loop, best of 3: 12.7 s per loop

#another solution, but slowier a bit
In [72]: %%timeit
     ...: arr = df.add('; ').fillna('').values  
     ...: s = [''.join(x).strip('; ') for x in arr]
     ...: pd.Series([y if y != '' else None for y in s])
     ...: 
     ...: 
10 loops, best of 3: 119 ms per loop

113

answered Oct 09 '22 20:10

jezrael

Related questions
                            
                                How to import and call a Python function in a Jinja template? [closed]
                            
                                Get keys of pandas.Series.value_counts
                            
                                How can I display the test name *after* the test using pytest?
                            
                                Convert array into percentiles
                            
                                why is that people use sqlalchemy CORE to save data and use sqlalchemy ORM to query data
                            
                                what is the difference between scipy.stats module and numpy.random module, between similar methods that both modules have?
                            
                                How to get list of values in ImageDataGenerator.flow_from_directory Keras?
                            
                                Unresolved reference when calling a global variable?
                            
                                Use scrapy to get list of urls, and then scrape content inside those urls
                            
                                Convert PyQt5 QPixmap to numpy ndarray
                            
                                Best Algorithm to make correction typos in text
                            
                                Expanding/Zooming in a numpy array
                            
                                Memory Sharing among workers in gunicorn using --preload
                            
                                Filtering on index levels in a pandas.DataFrame
                            
                                Convert datetime to time in python
                            
                                Find column name in pandas that matches an array
                            
                                prevent duplicate celery logging
                            
                                properly mock celery task that is being called inside another celery task
                            
                                python -m: Error while finding module specification
                            
                                zsh: command not found: flake8 but flake8 is installed

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With