I have the following dummy dataframe: <pre class="prettyprint"><code>df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m'], 'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm']}) Col1 Col2 0 a,b,c,d aa~bb~cc~dd 1 e,f,g,h NaN 2 i,j,k,l,m ii~jj~kk~ll~mm </code></pre> The real dataset has shape <code>500000, 90</code>. I need to unnest these values to rows and I'm using the new <code>explode</code> method for this, which works fine. The problem is the <code>NaN</code>, these will cause unequal lengths after the <code>explode</code>, so I need to fill in the same amount of delimiters as the filled values. In this case <code>~~~</code> since row 1 has three comma's. <hr> expected output <pre class="prettyprint"><code> Col1 Col2 0 a,b,c,d aa~bb~cc~dd 1 e,f,g,h ~~~ 2 i,j,k,l,m ii~jj~kk~ll~mm </code></pre> Attempt 1: <pre class="prettyprint"><code>df['Col2'].fillna(df['Col1'].str.count(',')*'~') </code></pre> Attempt 2: <pre class="prettyprint"><code>np.where(df['Col2'].isna(), df['Col1'].str.count(',')*'~', df['Col2']) </code></pre> <hr> This works, but I feel like there's an easier method for this: <pre class="prettyprint"><code>characters = df['Col1'].str.replace('\w', '').str.replace(',', '~') df['Col2'] = df['Col2'].fillna(characters) print(df) Col1 Col2 0 a,b,c,d aa~bb~cc~dd 1 e,f,g,h ~~~ 2 i,j,k,l,m ii~jj~kk~ll~mm d1 = df.assign(Col1=df['Col1'].str.split(',')).explode('Col1')[['Col1']] d2 = df.assign(Col2=df['Col2'].str.split('~')).explode('Col2')[['Col2']] final = pd.concat([d1,d2], axis=1) print(final) Col1 Col2 0 a aa 0 b bb 0 c cc 0 d dd 1 e 1 f 1 g 1 h 2 i ii 2 j jj 2 k kk 2 l ll 2 m mm </code></pre> <hr> Question: is there an easier and more generalized method for this? Or is my method fine as is.

<h3><code>pd.concat</code></h3> <pre class="prettyprint"><code>delims = {'Col1': ',', 'Col2': '~'} pd.concat({ k: df[k].str.split(delims[k], expand=True) for k in df}, axis=1 ).stack() Col1 Col2 0 0 a aa 1 b bb 2 c cc 3 d dd 1 0 e NaN 1 f NaN 2 g NaN 3 h NaN 2 0 i ii 1 j jj 2 k kk 3 l ll 4 m mm </code></pre> This loops on columns in <code>df</code>. It may be wiser to loop on keys in the <code>delims</code> dictionary. <pre class="prettyprint"><code>delims = {'Col1': ',', 'Col2': '~'} pd.concat({ k: df[k].str.split(delims[k], expand=True) for k in delims}, axis=1 ).stack() </code></pre> <hr> <h3>Same thing, different look</h3> <pre class="prettyprint"><code>delims = {'Col1': ',', 'Col2': '~'} def f(c): return df[c].str.split(delims[c], expand=True) pd.concat(map(f, delims), keys=delims, axis=1).stack() </code></pre>

One way is using <code>str.repeat</code> and <code>fillna()</code> not sure how efficient this is though: <pre class="prettyprint"><code>df.Col2.fillna(pd.Series(['~']*len(df)).str.repeat(df.Col1.str.count(','))) </code></pre> <hr> <pre class="prettyprint"><code>0 aa~bb~cc~dd 1 ~~~ 2 ii~jj~kk~ll~mm Name: Col2, dtype: object </code></pre>

Fill in same amount of characters where other column is NaN

Tags:

python

pandas

explode

unnest

I have the following dummy dataframe:

df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m'],
                   'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm']})

        Col1            Col2
0    a,b,c,d     aa~bb~cc~dd
1    e,f,g,h             NaN
2  i,j,k,l,m  ii~jj~kk~ll~mm

The real dataset has shape 500000, 90.

I need to unnest these values to rows and I'm using the new explode method for this, which works fine.

The problem is the NaN, these will cause unequal lengths after the explode, so I need to fill in the same amount of delimiters as the filled values. In this case ~~~ since row 1 has three comma's.

expected output

        Col1            Col2
0    a,b,c,d     aa~bb~cc~dd
1    e,f,g,h             ~~~
2  i,j,k,l,m  ii~jj~kk~ll~mm

Attempt 1:

df['Col2'].fillna(df['Col1'].str.count(',')*'~')

Attempt 2:

np.where(df['Col2'].isna(), df['Col1'].str.count(',')*'~', df['Col2'])

This works, but I feel like there's an easier method for this:

characters = df['Col1'].str.replace('\w', '').str.replace(',', '~')
df['Col2'] = df['Col2'].fillna(characters)

print(df)

        Col1            Col2
0    a,b,c,d     aa~bb~cc~dd
1    e,f,g,h             ~~~
2  i,j,k,l,m  ii~jj~kk~ll~mm

d1 = df.assign(Col1=df['Col1'].str.split(',')).explode('Col1')[['Col1']]
d2 = df.assign(Col2=df['Col2'].str.split('~')).explode('Col2')[['Col2']]

final = pd.concat([d1,d2], axis=1)
print(final)

  Col1 Col2
0    a   aa
0    b   bb
0    c   cc
0    d   dd
1    e     
1    f     
1    g     
1    h     
2    i   ii
2    j   jj
2    k   kk
2    l   ll
2    m   mm

Question: is there an easier and more generalized method for this? Or is my method fine as is.

622

asked Sep 03 '19 15:09

Erfan

3 Answers

`pd.concat`

delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
    k: df[k].str.split(delims[k], expand=True)
    for k in df}, axis=1
).stack()

    Col1 Col2
0 0    a   aa
  1    b   bb
  2    c   cc
  3    d   dd
1 0    e  NaN
  1    f  NaN
  2    g  NaN
  3    h  NaN
2 0    i   ii
  1    j   jj
  2    k   kk
  3    l   ll
  4    m   mm

This loops on columns in df. It may be wiser to loop on keys in the delims dictionary.

delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
    k: df[k].str.split(delims[k], expand=True)
    for k in delims}, axis=1
).stack()

Same thing, different look

delims = {'Col1': ',', 'Col2': '~'}
def f(c): return df[c].str.split(delims[c], expand=True)
pd.concat(map(f, delims), keys=delims, axis=1).stack()

answered Sep 18 '22 15:09

piRSquared

One way is using str.repeat and fillna() not sure how efficient this is though:

df.Col2.fillna(pd.Series(['~']*len(df)).str.repeat(df.Col1.str.count(',')))

0       aa~bb~cc~dd
1               ~~~
2    ii~jj~kk~ll~mm
Name: Col2, dtype: object

answered Sep 17 '22 15:09

anky

Just split the dataframe into two

df1=df.dropna()
df2=df.drop(df1.index)

d1 = df1['Col1'].str.split(',').explode()
d2 = df1['Col2'].str.split('~').explode()
d3 = df2['Col1'].str.split(',').explode()

final = pd.concat([d1, d2], axis=1).append(d3.to_frame(),sort=False)
Out[77]: 
  Col1 Col2
0    a   aa
0    b   bb
0    c   cc
0    d   dd
2    i   ii
2    j   jj
2    k   kk
2    l   ll
2    m   mm
1    e  NaN
1    f  NaN
1    g  NaN
1    h  NaN