Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I split this column containing a mix of tuples/None more efficiently?

I have a simple DataFrame:

import pandas as pd
df = pd.DataFrame({'id':list('abcd')})
df['tuples'] = df.index.map(lambda i:(i,i+1))

# outputs:
#   id  tuples
# 0  a  (0, 1)
# 1  b  (1, 2)
# 2  c  (2, 3)
# 3  d  (3, 4)

I can then split the tuples column into two very simply, e.g.

df[['x','y']] = pd.DataFrame(df.tuples.tolist())

# outputs:
#   id  tuples  x  y
# 0  a  (0, 1)  0  1
# 1  b  (1, 2)  1  2
# 2  c  (2, 3)  2  3
# 3  d  (3, 4)  3  4

This approach also works:

df[['x','y']] = df.apply(lambda x:x.tuples,result_type='expand',axis=1)

However if my DataFrame is slightly more complex, e.g.

df = pd.DataFrame({'id':list('abcd')})
df['tuples'] = df.index.map(lambda i:(i,i+1) if i%2 else None)

# outputs:
#   id  tuples
# 0  a    None
# 1  b  (1, 2)
# 2  c    None
# 3  d  (3, 4)

then the first approach throws "Columns must be same length as key" (of course) because some rows have two values and some have none, and my code anticipates two.

I can use .loc to create single columns, twice.

get_rows = df.tuples.notnull() # return rows with tuples

df.loc[get_rows,'x'] = df.tuples.str[0]
df.loc[get_rows,'y'] = df.tuples.str[1]

# outputs:
#   id  tuples    x    y
# 0  a    None  NaN  NaN
# 1  b  (1, 2)  1.0  2.0
# 2  c    None  NaN  NaN
# 3  d  (3, 4)  3.0  4.0

[Aside: useful how the indexing carries assigns only relevant rows from the right, without having to specify them.]

However, I can't use .loc to create two columns at once, e.g.

# This isn't valid use of .loc
df.loc[get_rows,['x','y']] = df.loc[get_rows,'tuples'].map(lambda x:list(x))

as it throws the error "shape mismatch: value array of shape (2,2) could not be broadcast to indexing result of shape (2,)".

I also can't use this

df[get_rows][['x','y']] = df[get_rows].apply(lambda x:x.tuples,result_type='expand',axis=1)

as it throws the usual "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc..."

I can't help thinking I'm missing something.

like image 920
angus l Avatar asked Jul 31 '19 12:07

angus l


3 Answers

Here is another way (comments inline):

c=df.tuples.astype(bool) #similar to df.tuples.notnull()
#create a dataframe by dropping the None and assign index as df.index where c is True
d=pd.DataFrame(df.tuples.dropna().values.tolist(),columns=list('xy'),index=df[c].index)
final=pd.concat([df,d],axis=1) #concat them both

  id  tuples    x    y
0  a    None  NaN  NaN
1  b  (1, 2)  1.0  2.0
2  c    None  NaN  NaN
3  d  (3, 4)  3.0  4.0
like image 182
anky Avatar answered Nov 13 '22 02:11

anky


df[get_rows] is a copy, set value to df[get_rows][['x','y']] does not change the underlying data. Just use df[['x','y']] to create now columns.

df = pd.DataFrame({'id':list('abcd')})

df['tuples'] = df.index.map(lambda i:(i,i+1) if i%2 else None)

get_rows = df.tuples.notnull()

df[['x','y']] = df[get_rows].apply(lambda x:x.tuples,result_type='expand',axis=1)

print(df)

  id  tuples    x    y
0  a    None  NaN  NaN
1  b  (1, 2)  1.0  2.0
2  c    None  NaN  NaN
3  d  (3, 4)  3.0  4.0
like image 21
Yuan Avatar answered Nov 13 '22 03:11

Yuan


Another quick fix:

pd.concat([df, pd.DataFrame(df.tuples.to_dict()).T], 
          axis=1)

returns:

  id  tuples     0     1
0  a    None  None  None
1  b  (1, 2)     1     2
2  c    None  None  None
3  d  (3, 4)     3     4
like image 1
Quang Hoang Avatar answered Nov 13 '22 03:11

Quang Hoang