Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

using rsplit on pandas dataframe column to separate based on second instance of a delimiter

I have a column of a pandas dataframe that I would like to split and expand into a new dataframe based on the second instance of a delimiter. I was splitting based on the last instance of the delimiter, but unfortunately there are a handful of instances in ~80k rows that have 4 '_' instead of 3.

For example, I have a dataframe with multiple columns where the one I would like to split into a new dataframe looks like the following:

df.head()

   gene
0  NM_000000_foo_blabla
1  NM_000001_bar

and I want to split & expand it such that it separates to this:

(Desired)

df2.head()

   col1          col2
0  NM_000000     foo_bar
1  NM_000001     foo

In using my current code:

df2 = df['gene'].str.rsplit('_', 1, expand=True).rename(lambda x: f'col{x + 1}', axis=1) 

I get this:

(Actual)

df2.head()

   col1          col2
0  NM_000000_foo bar
1  NM_000001     foo

Is there a simple way to achieve this my modifying the line of code I'm already using? I tried playing with the number of splits in rsplit but couldn't achieve the result I was looking for. Thanks!

like image 228
Margaret Gruca Avatar asked Jan 27 '26 01:01

Margaret Gruca


2 Answers

Since your data seems to be fairly well defined, you can extract on the second instance of the delimiter using a regular expression.

df['gene'].str.extract(r'(?:[^_]+_){2}(.*)')

            0
0  foo_blabla
1         bar

You can generalize this to be any delimiter, and match it any number of times using a simple function:

def build_regex(delimiter, num_matches=1):
    return rf'(?:[^{delimiter}]+{delimiter}){{{num_matches}}}(.*)'

>>> build_regex('_', 2)
'(?:[^_]+_){2}(.*)'

>>> df['gene'].str.extract(build_regex('_', 2))
            0
0  foo_blabla
1         bar

>>> df['gene'].str.extract(build_regex('_', 3))
        0
0  blabla
1     NaN

Regex Explanation

(?:            # non capture group
  [^_]+        # match anything but _ one or more times
  _            # match _
){2}           # match this group 2 times
(              # start of capture group 1
  .*           # match anything greedily
)              # end of matching group 1

If there wasn't guaranteed to be text before either of the first two delimiters, you can also make the not assertion match 0 or more times:

(?:[^_]*_){2}(.*)
like image 180
user3483203 Avatar answered Jan 29 '26 14:01

user3483203


Just replace 2nd '_' by your custom deliminator and split on it

df.gene.str.replace(r'([^_]+_[^_]+)_', r'\1|').str.split('|', expand=True)

Out[488]:
           0           1
0  NM_000000  foo_blabla
1  NM_000001  bar 
like image 45
Andy L. Avatar answered Jan 29 '26 13:01

Andy L.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!