How to split the string by '/' and reform it by the split substrings in a dataframe?

Tags:

I need to split the words based on the character '/' and reform the words in this way:

This dataframe contains some kids and their presents for Easter. Some kids have two presents, while some have only one.

data = {'Presents':['Pink Doll / Ball', 'Bear/ Ball', 'Barbie', 'Blue Sunglasses/Airplane', 'Orange Kitchen/Car', 'Bear/Doll', 'Purple Game'],
        'Kids':  ['Chris', 'Jane', 'Betty', 'Harry', 'Claire', 'Sofia', 'Alex']
        }

df = pd.DataFrame (data, columns = ['Presents', 'Kids'])

print (df)

This dataframe looks like this:

                   Presents    Kids
0          Pink Doll / Ball   Chris
1                Bear/ Ball    Jane
2                    Barbie   Betty
3  Blue Sunglasses/Airplane   Harry
4        Orange Kitchen/Car  Claire
5                 Bear/Doll   Sofia
6               Purple Game    Alex

I try to delimit their presents and also to reform them in this way, keeping their associated colors:

'Pink Doll/Ball' will be split into two parts: 'Pink Doll', 'Pink Ball'. In addition to this, the same kid should be associated to their presents.

The colours and the presents can be anything, we just know that the structure is: Colour Present1/Present2, or Colour Present or just Present. So finally, it should be:

for Colour Present/Present --> Colour Present1 and Colour Present2
for Colour Present ---> Colour Present
for Present ---> Present

So the final dataframe should look like this:

           Presents    Kids
0         Pink Doll   Chris
1         Pink Ball   Chris
2              Bear    Jane
3              Ball    Jane
4            Barbie   Betty
5   Blue Sunglasses   Harry
6     Blue Airplane   Harry
7    Orange Kitchen  Claire
8        Orange Car  Claire
9              Bear   Sofia
10             Doll   Sofia
11      Purple Game    Alex

My first approach was to transform the columns into lists and work with lists. Like this:

def count_total_words(string):
    total = 1
    for i in range(len(string)):
        if (string[i] == ' '):
            total = total + 1
    return total

coloured_presents_to_remove_list = []
index_with_slash_list = []
first_present = ''
second_present= ''
index_with_slash = -1
refactored_second_present = ''
for coloured_present in coloured_presents_list:
    if (coloured_present.find('/') >= 0):
        index_with_slash = coloured_presents_list.index(coloured_present)
        index_with_slash_list.append(index_with_slash)
        first_present, second_present = coloured_present.split('/')
        coloured_presents_to_remove_list.append(coloured_present)
        if count_total_words(first_present) == 2:
            refactored_second_present = first_present.split(' ', 1)[0] + ' ' + second_present
            second_present = refactored_second_present
        coloured_presents_list.append(first_present)
        coloured_presents_list.append(second_present)
        kids_list.insert(coloured_presents_list.index(first_present), kids_list[index_with_slash])
        kids_list.insert(coloured_presents_list.index(second_present), kids_list[index_with_slash])
        
for present in coloured_presents_to_remove_list:
    coloured_presents_list.remove(present)

for index in index_with_slash_list:
    kids_list.pop(index)

However, I have realized that in some point, I might lose some index by mistake so I tried working with pandas into dataframe.

mask = df['Presents'].str.contains('/', na=False, regex=False)
df['First Present'], df['Second Present'] = df.loc[mask, 'Presents'].split('/')

890

asked Mar 31 '21 14:03

Elisa L.

2 Answers

You could use str.split using a regex with expand=True to get your first and second present. Note that this will handle the three cases 'present1/present2', 'coulour present' and 'present'. In the latter two cases the newly created column 'present2' will be None.

To handle the case 'colour present1/present2' you can use str.extract with a regular expression containing permissible colours (see colours_regex below). This is to distinguish colour from presents consisting of two words (e.g 'Barby Doll').

The final step is then to use melt with 'Kids' as an identifier

df[['present1', 'present2']] = df.Presents.str.split('\s*/\s*', expand=True)
colours_regex = '(Blue|Purple|Pink|Orange)'  # maybe not ideal if there are vast amounts of colours as this needs updating for every colour
df['colour'] = df.present1.str.extract(colours_regex)
df.loc[df.colour.notnull()&df.present2.notnull(), 'present2'] = df.loc[df.colour.notnull()&df.present2.notnull(), ['colour', 'present2']].agg(' '.join, axis=1)
result = df.melt(id_vars='Kids', value_vars=['present1', 'present2'], value_name='Present')
result = result.loc[result.Present.notnull(), ['Present', 'Kids']]

160

answered Oct 11 '22 12:10

gofvonx

Try this one:

s = df['Presents'].str.split('/')
a , b = s.str[0].str.strip() , s.str[-1].str.strip()
c = a.str.count(' ').gt(0) & s.str.len().ge(2)
arr = np.where(c,b.radd(a.str.split().str[0].str.strip()+' '),b)
out = (pd.concat((a,pd.Series(arr,index=s.index,name=s.name)))
       .sort_index().to_frame().join(df[['Kids']]))
pd.DataFrame.drop_duplicates(out)

The results are these by using the above code:

         Presents    Kids
0        Pink Doll   Chris
0        Pink Ball   Chris
1             Bear    Jane
1             Ball    Jane
2           Barbie   Betty
2           Barbie   Betty
3  Blue Sunglasses   Harry
3    Blue Airplane   Harry
4   Orange Kitchen  Claire
4       Orange Car  Claire
5             Bear   Sofia
5             Doll   Sofia
6      Purple Game    Alex
6      Purple Game    Alex

Happy Coding!

answered Oct 11 '22 11:10

Ariadne R.

Related questions
                            
                                Pip SSLError WRONG_VERSION_NUMBER under proxy
                            
                                How to convert a string representation of a list without double quoted elements to an actual list?
                            
                                Getting % Rate using Pandas Group By and .sum()
                            
                                Use GPU on python docker image
                            
                                Python can have virtual environments, is there an equivalent for Dart/flutter?
                            
                                How to check if a URL is downloadable in requests
                            
                                Generating list of probabilites
                            
                                Rotate through list of delimiters in join()
                            
                                How to fix discord music bot that stops playing before the song is actually over?
                            
                                Pandas: add new column with count how often the highest score of a day was reached by this person
                            
                                How to compare an array against a list of arrays?
                            
                                Pandas read_excel function ignoring dtype
                            
                                how to prevent Poetry to consider .gitignore
                            
                                StartQueryExecution operation: Unable to verify/create output bucket
                            
                                FastAPI How to fix error walking file system: OSError [Errno 40] Too many levels of symbolic links: '/sys/class/vtconsole/vtcon0/subsystem?
                            
                                RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces)
                            
                                How to index an array with its indices in numpy?
                            
                                Stripe Checkout - Create Session - Apply Tax Rates on subscriptions
                            
                                Same output in different workers in multiprocessing
                            
                                What is the purpose of graph collections in TensorFlow?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to split the string by '/' and reform it by the split substrings in a dataframe?

Tags:

python

pandas

dataframe

Elisa L.

People also ask

2 Answers

gofvonx

Ariadne R.

Recent Activity

Donate For Us