Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply regex over all the rows of a dataset?

I have a dataset as follow:

    data = {"C1" : ['DDDSSDSSDS','SSDDDSSDDS', 
   'DDDDDDDDDD','SSSSSSSSSS','SSSSSSSDSS','DDDDDSDDDD','SDDDDDDDDD']}
    dt = pd.DataFrame(data)
    print(dt)

For each string I want to get the positions of first element and last element of each "Uninterrupted S groups". For example, for first row I have 'DDDSSDSSDS' (as you see I have three groups of S) and my favorite output for this "S group"s is something like [(3,5),(6,8),(9-10)] which shows the positions for first and second and third "uninterrupted S groups" in first row.

So an example of output could be as:

           C1                         C2
0  DDDSSDSSDS       [(3, 5), (6, 8), (9-10)]
1  SSDDDSSDDS  [(0, 2), (5, 7), (9, 10)]
2  DDDDDDDDDD                         []
3  SSSSSSSSSS                  [(1, 11)]
4  SSSSSSSDSS          [(0, 7), (8, 10)]
5  DDDDDSDDDD                   [(5, 6)]
6  SDDDDDDDDD                   [(0, 1)]

My current solution is:

def split_it(mystring):
    x = re.findall('(S*)', mystring)
    if x :
      return(x)

dt['C2'] = dt['C1'].apply(split_it)
print(dt)

which leads to the following output:

0  DDDSSDSSDS  [, , , SS, , SS, , S, ]
1  SSDDDSSDDS  [SS, , , , SS, , , S, ]
2  DDDDDDDDDD   [, , , , , , , , , , ]
3  SSSSSSSSSS           [SSSSSSSSSS, ]
4  SSSSSSSDSS        [SSSSSSS, , SS, ]
5  DDDDDSDDDD  [, , , , , S, , , , , ]
6  SDDDDDDDDD  [S, , , , , , , , , , ]
like image 966
Joe the Second Avatar asked Oct 10 '20 20:10

Joe the Second


People also ask

Can you use regex in pandas?

A regular expression (regex) is a sequence of characters that define a search pattern. To filter rows in Pandas by regex, we can use the str. match() method.

What does \r represent in regex?

The \r metacharacter matches carriage return characters.

What does \d include in regex?

Decimal digit character: \d \d matches any decimal digit. It is equivalent to the \p{Nd} regular expression pattern, which includes the standard decimal digits 0-9 as well as the decimal digits of a number of other character sets.

What is regex AZ match?

The regular expression [A-Z][a-z]* matches any sequence of letters that starts with an uppercase letter and is followed by zero or more lowercase letters.

Are there regex functions in Excel?

Are there Regex functions in Excel? What is regular expression? A regular expression (aka regex or regexp) is a specially encoded sequence of characters that defines a search pattern. Using that pattern, you can find a matching character combination in a string or validate data input.

How to use regular expressions in VBA?

To start using regular expressions in VBA, you need to either activate the RegEx object reference library or use the CreateObject function. To save you the trouble of setting the reference in the VBA editor, we chose the latter approach. Pattern - is the pattern to match in the input string.

What is a regular expression in Python?

A regular expression (aka regex or regexp) is a specially encoded sequence of characters that defines a search pattern. Using that pattern, you can find a matching character combination in a string or validate data input. If you are familiar with a wildcard notation, you can think of regexes as an advanced version of wildcards.

How to apply a function to every row in a Dataframe?

It provides with a huge amount of Classes and function which help in analyzing and manipulating data in an easier way. One can use apply () function in order to apply function to every row in given dataframe.


Video Answer


1 Answers

You can use

def split_it(mystring):
    return [(m.start(), m.end()) for m in re.finditer('S+', mystring)]

Output:

>>> dt['C1'].apply(split_it)
0    [(3, 5), (6, 8), (9, 10)]
1    [(0, 2), (5, 7), (9, 10)]
2                           []
3                    [(0, 10)]
4            [(0, 7), (8, 10)]
5                     [(5, 6)]
6                     [(0, 1)]
Name: C1, dtype: object

The re.finditer('S+', mystring) returns all match objects found in the string and you may get the start and end positions via .start() and .end() calls.

Note you got empty matches in your output because S* matches zero or more S chars, you need to use + to match one or more.

like image 197
Wiktor Stribiżew Avatar answered Oct 07 '22 04:10

Wiktor Stribiżew