Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas convert array column into multiple columns with a condition

Tags:

pandas

I have a pandas data frame with 2 columns:

  • embedding as an array column and size of embedding = size_of_embedding
  • language

like this:

embedding              language
[0.1 0.2 0.3]           fr
[0.1 0.4 0.4]           en
[0.8 0.1 0.1]           fr

Given a beginning integer n = 10, for each value of embedding column, I want to add a column to the above data frame like this:

embedding            language          feature1     feature2  feature3
[0.1 0.2 0.3]          fr              10:0.1        11:0.2    12:0.3
[0.1 0.4 0.4]          en              13:0.1        14:0.4    15:0.4
[0.8 0.1 0.1]          fr              10:0.8        11:0.1    12:0.1

So, feature1 = 1st embedding value, feature2 = 2nd embedding value .... For the next language the beginning feature value = n+size_of_embedding:. So, for each language, the number of columns added is exactly equal to the size_of_embedding array. and for each next language encountered, we start with n+size_of_embedding:. Is there an easy way of doing this? Thanks.

like image 983
learner Avatar asked May 09 '20 01:05

learner


2 Answers

first ensure that the embedding column is in fact an array. If it is stored as string, you can convert it to a numpy array like so:

df.embedding = df.embedding.apply(lambda x: np.fromstring(x[1:-1], sep=' '))

create a lookup list of languages and their starting values, and use that to generate the features

lookup = {'fr': 10, 'en': 13}

If you have too many languages to create this by hand, you could try the following statement, replacing 10 & 3 as is appropriate for your actual dataset

lookup = {l:10+i*3 for i, l in enumerate(df.language.drop_duplicates().to_list())}

Generating the features is then just a lookup & a list comprehension. Here I've used the helper function f to keep the code tidy.

def f(lang, embeddings): 
    return [f'{lookup[lang]+i}:{e}' for i, e in enumerate(embedding)]

new_names = ['feature1', 'feature2', 'feature3']
df[new_names] = df.apply(lambda x: f(x.language, x.embedding), axis=1, result_type='expand')

df now looks like:

         embedding language feature1 feature2 feature3
0  [0.1, 0.2, 0.3]       fr   10:0.1   11:0.2   12:0.3
1  [0.1, 0.4, 0.4]       en   13:0.1   14:0.4   15:0.4
2  [0.8, 0.1, 0.1]       fr   10:0.8   11:0.1   12:0.1
like image 180
Haleemur Ali Avatar answered Oct 19 '22 18:10

Haleemur Ali


Longhand

df=pd.DataFrame({'embedding':['[0.1 0.2 0.3]','[0.1 0.4 0.4]','[0.8 0.1 0.1]'],'language':['fre','en','fr']})
df['feature1']=0
df['feature2']=0
df['feature3']=0

df['z']=df.embedding.str.strip('\[\]')#Remove the box brackets

    df['y']=df.z.str.findall('(\d+[.]+\d+)')#extract each digit dot digit in the list

    lst=['10:','11:','12:']#Create List lookup for `fr/fre`
    lst2=['13:','14:','15:']##Create List lookup for `en`

Create two frames fo fr and en using boolean select

 m=df.language.isin(['en'])
    df2=df[~m]
    df3=df[m]

Compute feature1, feature2 and feature3

df2['k']=[lst+i for i in df2['y']]
df3['m']=[lst2+i for i in df3['y']]
df2['feature1']=[i[0]+i[len(df2['k'])] for i in df2['k']]
df2['feature2']=[i[1]+i[len(df2['k'])+1] for i in df2['k']]
df2['feature3']=[i[2]+i[len(df2['k'])+2] for i in df2['k']]

df3['feature1']=[i[0]+i[len(df3['m'])] for i in df3['m']]
df3['feature2']=[i[1]+i[len(df3['m'])+1] for i in df3['m']]
df3['feature3']=[i[2]+i[len(df3['m'])+2] for i in df3['m']]

Concat df2 and df3

pd.concat([df3.iloc[:,:5:],df2.iloc[:,:5:]])

enter image description here

like image 1
wwnde Avatar answered Oct 19 '22 20:10

wwnde