Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split the string by '/' and reform it by the split substrings in a dataframe?

I need to split the words based on the character '/' and reform the words in this way:

This dataframe contains some kids and their presents for Easter. Some kids have two presents, while some have only one.

data = {'Presents':['Pink Doll / Ball', 'Bear/ Ball', 'Barbie', 'Blue Sunglasses/Airplane', 'Orange Kitchen/Car', 'Bear/Doll', 'Purple Game'],
        'Kids':  ['Chris', 'Jane', 'Betty', 'Harry', 'Claire', 'Sofia', 'Alex']
        }

df = pd.DataFrame (data, columns = ['Presents', 'Kids'])

print (df)

This dataframe looks like this:

                   Presents    Kids
0          Pink Doll / Ball   Chris
1                Bear/ Ball    Jane
2                    Barbie   Betty
3  Blue Sunglasses/Airplane   Harry
4        Orange Kitchen/Car  Claire
5                 Bear/Doll   Sofia
6               Purple Game    Alex

I try to delimit their presents and also to reform them in this way, keeping their associated colors:

'Pink Doll/Ball' will be split into two parts: 'Pink Doll', 'Pink Ball'. In addition to this, the same kid should be associated to their presents.

The colours and the presents can be anything, we just know that the structure is: Colour Present1/Present2, or Colour Present or just Present. So finally, it should be:

  • for Colour Present/Present --> Colour Present1 and Colour Present2
  • for Colour Present ---> Colour Present
  • for Present ---> Present

So the final dataframe should look like this:

           Presents    Kids
0         Pink Doll   Chris
1         Pink Ball   Chris
2              Bear    Jane
3              Ball    Jane
4            Barbie   Betty
5   Blue Sunglasses   Harry
6     Blue Airplane   Harry
7    Orange Kitchen  Claire
8        Orange Car  Claire
9              Bear   Sofia
10             Doll   Sofia
11      Purple Game    Alex

My first approach was to transform the columns into lists and work with lists. Like this:

def count_total_words(string):
    total = 1
    for i in range(len(string)):
        if (string[i] == ' '):
            total = total + 1
    return total

coloured_presents_to_remove_list = []
index_with_slash_list = []
first_present = ''
second_present= ''
index_with_slash = -1
refactored_second_present = ''
for coloured_present in coloured_presents_list:
    if (coloured_present.find('/') >= 0):
        index_with_slash = coloured_presents_list.index(coloured_present)
        index_with_slash_list.append(index_with_slash)
        first_present, second_present = coloured_present.split('/')
        coloured_presents_to_remove_list.append(coloured_present)
        if count_total_words(first_present) == 2:
            refactored_second_present = first_present.split(' ', 1)[0] + ' ' + second_present
            second_present = refactored_second_present
        coloured_presents_list.append(first_present)
        coloured_presents_list.append(second_present)
        kids_list.insert(coloured_presents_list.index(first_present), kids_list[index_with_slash])
        kids_list.insert(coloured_presents_list.index(second_present), kids_list[index_with_slash])
        
for present in coloured_presents_to_remove_list:
    coloured_presents_list.remove(present)

for index in index_with_slash_list:
    kids_list.pop(index)

However, I have realized that in some point, I might lose some index by mistake so I tried working with pandas into dataframe.

mask = df['Presents'].str.contains('/', na=False, regex=False)
df['First Present'], df['Second Present'] = df.loc[mask, 'Presents'].split('/')
like image 890
Elisa L. Avatar asked Mar 31 '21 14:03

Elisa L.


People also ask

How do you split data in a Dataframe in Python?

split() Pandas provide a method to split string around a passed separator/delimiter. After that, the string can be stored as a list in a series or it can also be used to create multiple column data frames from a single separated string.

How do you split the pandas series?

split() function is used to split strings around given separator/delimiter. The function splits the string in the Series/Index from the beginning, at the specified delimiter string. Equivalent to str. split().


2 Answers

You could use str.split using a regex with expand=True to get your first and second present. Note that this will handle the three cases 'present1/present2', 'coulour present' and 'present'. In the latter two cases the newly created column 'present2' will be None.

To handle the case 'colour present1/present2' you can use str.extract with a regular expression containing permissible colours (see colours_regex below). This is to distinguish colour from presents consisting of two words (e.g 'Barby Doll').

The final step is then to use melt with 'Kids' as an identifier

df[['present1', 'present2']] = df.Presents.str.split('\s*/\s*', expand=True)
colours_regex = '(Blue|Purple|Pink|Orange)'  # maybe not ideal if there are vast amounts of colours as this needs updating for every colour
df['colour'] = df.present1.str.extract(colours_regex)
df.loc[df.colour.notnull()&df.present2.notnull(), 'present2'] = df.loc[df.colour.notnull()&df.present2.notnull(), ['colour', 'present2']].agg(' '.join, axis=1)
result = df.melt(id_vars='Kids', value_vars=['present1', 'present2'], value_name='Present')
result = result.loc[result.Present.notnull(), ['Present', 'Kids']]
like image 160
gofvonx Avatar answered Oct 11 '22 12:10

gofvonx


Try this one:

s = df['Presents'].str.split('/')
a , b = s.str[0].str.strip() , s.str[-1].str.strip()
c = a.str.count(' ').gt(0) & s.str.len().ge(2)
arr = np.where(c,b.radd(a.str.split().str[0].str.strip()+' '),b)
out = (pd.concat((a,pd.Series(arr,index=s.index,name=s.name)))
       .sort_index().to_frame().join(df[['Kids']]))
pd.DataFrame.drop_duplicates(out)

The results are these by using the above code:

         Presents    Kids
0        Pink Doll   Chris
0        Pink Ball   Chris
1             Bear    Jane
1             Ball    Jane
2           Barbie   Betty
2           Barbie   Betty
3  Blue Sunglasses   Harry
3    Blue Airplane   Harry
4   Orange Kitchen  Claire
4       Orange Car  Claire
5             Bear   Sofia
5             Doll   Sofia
6      Purple Game    Alex
6      Purple Game    Alex

Happy Coding!

like image 38
Ariadne R. Avatar answered Oct 11 '22 11:10

Ariadne R.