I have a pandas data frame with 2 columns:
like this:
embedding              language
[0.1 0.2 0.3]           fr
[0.1 0.4 0.4]           en
[0.8 0.1 0.1]           fr
Given a beginning integer n = 10, for each value of embedding column, I want to add a column to the above data frame like this:
embedding            language          feature1     feature2  feature3
[0.1 0.2 0.3]          fr              10:0.1        11:0.2    12:0.3
[0.1 0.4 0.4]          en              13:0.1        14:0.4    15:0.4
[0.8 0.1 0.1]          fr              10:0.8        11:0.1    12:0.1
So, feature1 = 1st embedding value, feature2 = 2nd embedding value .... For the next language the beginning feature value = n+size_of_embedding:. So, for each language, the number of columns added is exactly equal to the size_of_embedding array. and for each next language encountered, we start with n+size_of_embedding:. Is there an easy way of doing this? Thanks.
first ensure that the embedding column is in fact an array. If it is stored as string, you can convert it to a numpy array like so:
df.embedding = df.embedding.apply(lambda x: np.fromstring(x[1:-1], sep=' '))
create a lookup list of languages and their starting values, and use that to generate the features
lookup = {'fr': 10, 'en': 13}
If you have too many languages to create this by hand, you could try the following statement, replacing 10 & 3 as is appropriate for your actual dataset
lookup = {l:10+i*3 for i, l in enumerate(df.language.drop_duplicates().to_list())}
Generating the features is then just a lookup & a list comprehension. Here I've used the helper function f to keep the code tidy.
def f(lang, embeddings): 
    return [f'{lookup[lang]+i}:{e}' for i, e in enumerate(embedding)]
new_names = ['feature1', 'feature2', 'feature3']
df[new_names] = df.apply(lambda x: f(x.language, x.embedding), axis=1, result_type='expand')
df now looks like:
         embedding language feature1 feature2 feature3
0  [0.1, 0.2, 0.3]       fr   10:0.1   11:0.2   12:0.3
1  [0.1, 0.4, 0.4]       en   13:0.1   14:0.4   15:0.4
2  [0.8, 0.1, 0.1]       fr   10:0.8   11:0.1   12:0.1
Longhand
df=pd.DataFrame({'embedding':['[0.1 0.2 0.3]','[0.1 0.4 0.4]','[0.8 0.1 0.1]'],'language':['fre','en','fr']})
df['feature1']=0
df['feature2']=0
df['feature3']=0
df['z']=df.embedding.str.strip('\[\]')#Remove the box brackets
    df['y']=df.z.str.findall('(\d+[.]+\d+)')#extract each digit dot digit in the list
    lst=['10:','11:','12:']#Create List lookup for `fr/fre`
    lst2=['13:','14:','15:']##Create List lookup for `en`
Create two frames fo fr and en using boolean select
 m=df.language.isin(['en'])
    df2=df[~m]
    df3=df[m]
Compute feature1, feature2 and feature3
df2['k']=[lst+i for i in df2['y']]
df3['m']=[lst2+i for i in df3['y']]
df2['feature1']=[i[0]+i[len(df2['k'])] for i in df2['k']]
df2['feature2']=[i[1]+i[len(df2['k'])+1] for i in df2['k']]
df2['feature3']=[i[2]+i[len(df2['k'])+2] for i in df2['k']]
df3['feature1']=[i[0]+i[len(df3['m'])] for i in df3['m']]
df3['feature2']=[i[1]+i[len(df3['m'])+1] for i in df3['m']]
df3['feature3']=[i[2]+i[len(df3['m'])+2] for i in df3['m']]
Concat df2 and df3
pd.concat([df3.iloc[:,:5:],df2.iloc[:,:5:]])

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With