I have a dataset as follow:
data = {"C1" : ['DDDSSDSSDS','SSDDDSSDDS',
'DDDDDDDDDD','SSSSSSSSSS','SSSSSSSDSS','DDDDDSDDDD','SDDDDDDDDD']}
dt = pd.DataFrame(data)
print(dt)
For each string I want to get the positions of first element and last element of each "Uninterrupted S groups". For example, for first row I have 'DDDSSDSSDS' (as you see I have three groups of S) and my favorite output for this "S group"s is something like [(3,5),(6,8),(9-10)]
which shows the positions for first and second and third "uninterrupted S groups" in first row.
So an example of output could be as:
C1 C2
0 DDDSSDSSDS [(3, 5), (6, 8), (9-10)]
1 SSDDDSSDDS [(0, 2), (5, 7), (9, 10)]
2 DDDDDDDDDD []
3 SSSSSSSSSS [(1, 11)]
4 SSSSSSSDSS [(0, 7), (8, 10)]
5 DDDDDSDDDD [(5, 6)]
6 SDDDDDDDDD [(0, 1)]
My current solution is:
def split_it(mystring):
x = re.findall('(S*)', mystring)
if x :
return(x)
dt['C2'] = dt['C1'].apply(split_it)
print(dt)
which leads to the following output:
0 DDDSSDSSDS [, , , SS, , SS, , S, ]
1 SSDDDSSDDS [SS, , , , SS, , , S, ]
2 DDDDDDDDDD [, , , , , , , , , , ]
3 SSSSSSSSSS [SSSSSSSSSS, ]
4 SSSSSSSDSS [SSSSSSS, , SS, ]
5 DDDDDSDDDD [, , , , , S, , , , , ]
6 SDDDDDDDDD [S, , , , , , , , , , ]
A regular expression (regex) is a sequence of characters that define a search pattern. To filter rows in Pandas by regex, we can use the str. match() method.
The \r metacharacter matches carriage return characters.
Decimal digit character: \d \d matches any decimal digit. It is equivalent to the \p{Nd} regular expression pattern, which includes the standard decimal digits 0-9 as well as the decimal digits of a number of other character sets.
The regular expression [A-Z][a-z]* matches any sequence of letters that starts with an uppercase letter and is followed by zero or more lowercase letters.
Are there Regex functions in Excel? What is regular expression? A regular expression (aka regex or regexp) is a specially encoded sequence of characters that defines a search pattern. Using that pattern, you can find a matching character combination in a string or validate data input.
To start using regular expressions in VBA, you need to either activate the RegEx object reference library or use the CreateObject function. To save you the trouble of setting the reference in the VBA editor, we chose the latter approach. Pattern - is the pattern to match in the input string.
A regular expression (aka regex or regexp) is a specially encoded sequence of characters that defines a search pattern. Using that pattern, you can find a matching character combination in a string or validate data input. If you are familiar with a wildcard notation, you can think of regexes as an advanced version of wildcards.
It provides with a huge amount of Classes and function which help in analyzing and manipulating data in an easier way. One can use apply () function in order to apply function to every row in given dataframe.
You can use
def split_it(mystring):
return [(m.start(), m.end()) for m in re.finditer('S+', mystring)]
Output:
>>> dt['C1'].apply(split_it)
0 [(3, 5), (6, 8), (9, 10)]
1 [(0, 2), (5, 7), (9, 10)]
2 []
3 [(0, 10)]
4 [(0, 7), (8, 10)]
5 [(5, 6)]
6 [(0, 1)]
Name: C1, dtype: object
The re.finditer('S+', mystring)
returns all match objects found in the string and you may get the start and end positions via .start()
and .end()
calls.
Note you got empty matches in your output because S*
matches zero or more S
chars, you need to use +
to match one or more.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With