I have a pandas data frame with 2 columns:
like this:
embedding language
[0.1 0.2 0.3] fr
[0.1 0.4 0.4] en
[0.8 0.1 0.1] fr
Given a beginning integer n = 10, for each value of embedding column, I want to add a column to the above data frame like this:
embedding language feature1 feature2 feature3
[0.1 0.2 0.3] fr 10:0.1 11:0.2 12:0.3
[0.1 0.4 0.4] en 13:0.1 14:0.4 15:0.4
[0.8 0.1 0.1] fr 10:0.8 11:0.1 12:0.1
So, feature1 = 1st embedding value, feature2 = 2nd embedding value .... For the next language the beginning feature value = n+size_of_embedding:. So, for each language, the number of columns added is exactly equal to the size_of_embedding array. and for each next language encountered, we start with n+size_of_embedding:. Is there an easy way of doing this? Thanks.
first ensure that the embedding
column is in fact an array. If it is stored as string, you can convert it to a numpy array like so:
df.embedding = df.embedding.apply(lambda x: np.fromstring(x[1:-1], sep=' '))
create a lookup list of languages and their starting values, and use that to generate the features
lookup = {'fr': 10, 'en': 13}
If you have too many languages to create this by hand, you could try the following statement, replacing 10
& 3
as is appropriate for your actual dataset
lookup = {l:10+i*3 for i, l in enumerate(df.language.drop_duplicates().to_list())}
Generating the features is then just a lookup & a list comprehension. Here I've used the helper function f
to keep the code tidy.
def f(lang, embeddings):
return [f'{lookup[lang]+i}:{e}' for i, e in enumerate(embedding)]
new_names = ['feature1', 'feature2', 'feature3']
df[new_names] = df.apply(lambda x: f(x.language, x.embedding), axis=1, result_type='expand')
df now looks like:
embedding language feature1 feature2 feature3
0 [0.1, 0.2, 0.3] fr 10:0.1 11:0.2 12:0.3
1 [0.1, 0.4, 0.4] en 13:0.1 14:0.4 15:0.4
2 [0.8, 0.1, 0.1] fr 10:0.8 11:0.1 12:0.1
Longhand
df=pd.DataFrame({'embedding':['[0.1 0.2 0.3]','[0.1 0.4 0.4]','[0.8 0.1 0.1]'],'language':['fre','en','fr']})
df['feature1']=0
df['feature2']=0
df['feature3']=0
df['z']=df.embedding.str.strip('\[\]')#Remove the box brackets
df['y']=df.z.str.findall('(\d+[.]+\d+)')#extract each digit dot digit in the list
lst=['10:','11:','12:']#Create List lookup for `fr/fre`
lst2=['13:','14:','15:']##Create List lookup for `en`
Create two frames fo fr and en using boolean select
m=df.language.isin(['en'])
df2=df[~m]
df3=df[m]
Compute feature1, feature2
and feature3
df2['k']=[lst+i for i in df2['y']]
df3['m']=[lst2+i for i in df3['y']]
df2['feature1']=[i[0]+i[len(df2['k'])] for i in df2['k']]
df2['feature2']=[i[1]+i[len(df2['k'])+1] for i in df2['k']]
df2['feature3']=[i[2]+i[len(df2['k'])+2] for i in df2['k']]
df3['feature1']=[i[0]+i[len(df3['m'])] for i in df3['m']]
df3['feature2']=[i[1]+i[len(df3['m'])+1] for i in df3['m']]
df3['feature3']=[i[2]+i[len(df3['m'])+2] for i in df3['m']]
Concat df2
and df3
pd.concat([df3.iloc[:,:5:],df2.iloc[:,:5:]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With