Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Combine pandas string columns with missing values




I need to concat the strings in 2 or more columns of a pandas dataframe.

I found this answer, which works fine if you don't have any missing value. Unfortunately, I have, and this leads to things like "ValueA; None", which is not really clean.

Example data:

col_A  | col_B
------ | ------
val_A  | val_B 
None   | val_B 
val_A  | None 
None   | None

I need this result:

like image 374
CoMartel Avatar asked Aug 31 '17 08:08


People also ask

How do I concatenate multiple columns in pandas?

By use + operator simply you can concatenate two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.

How do I concatenate strings in pandas?

Pandas str.cat() is used to concatenate strings to the passed caller series of string. Distinct values from a different series can be passed but the length of both the series has to be same. . str has to be prefixed to differentiate it from the Python's default method.

How do I merge column values in pandas?

To start, you may use this template to concatenate your column values (for strings only): df['New Column Name'] = df['1st Column Name'] + df['2nd Column Name'] + ... Notice that the plus symbol ('+') is used to perform the concatenation.

How do I concatenate two DataFrame columns in Python?

Let's discuss how to Concatenate two columns of dataframe in pandas python. We can do this by using the following functions : concat() append()

1 Answers

You can use apply with if-else:

df = df.apply(lambda x: None if x.isnull().all() else ';'.join(x.dropna()), axis=1)
print (df)
0    val_A;val_B
1          val_B
2          val_A
3           None
dtype: object

For faster solution is possible use:

#add separator and replace NaN to empty space
#convert to lists
arr = df.add('; ').fillna('').values.tolist()
#list comprehension, replace empty spaces to NaN
s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True)
#replace NaN to None
s = s.where(s.notnull(), None)
print (s)
0    val_A;val_B
1          val_B
2          val_A
3           None
dtype: object

#40000 rows
df = pd.concat([df]*10000).reset_index(drop=True)

In [70]: %%timeit
    ...: arr = df.add('; ').fillna('').values.tolist()
    ...: s = pd.Series([''.join(x).strip('; ') for x in arr]).replace('^$', np.nan, regex=True)
    ...: s.where(s.notnull(), None)
10 loops, best of 3: 74 ms per loop

In [71]: %%timeit
    ...: df.apply(lambda x: None if x.isnull().all() else ';'.join(x.dropna()), axis=1)
1 loop, best of 3: 12.7 s per loop

#another solution, but slowier a bit
In [72]: %%timeit
     ...: arr = df.add('; ').fillna('').values  
     ...: s = [''.join(x).strip('; ') for x in arr]
     ...: pd.Series([y if y != '' else None for y in s])
10 loops, best of 3: 119 ms per loop
like image 113
jezrael Avatar answered Oct 09 '22 20:10
