I have a strings in a dataframe in the following format.
abc.T01.xyz
abc.def.T01.xyz
abc.def.ghi.xyz
I need to filter the rows where this string has values matching this expression.
[a-zA-Z].T[0-9].[a-zA-Z]
I have used the following command, but it is giving me the strings that look like this as well: [a-zA-Z].[a-zA-Z].T[0-9].[a-zA-Z] which I don't want in my result.
mydf2 = mydf1.where('col1 rlike ".*\.T.*\..*"')
mydf2.show()
I am missing something in my regex.
Just translate your requirements instead of using a dot-star-soup and add anchors:
# [a-zA-Z].T[0-9].[a-zA-Z]
mydf2 = mydf1.where('col1 rlike "^[a-zA-Z.]+\.T[0-9]+\.[a-zA-Z.]+$"')
See a demo on regex101.com.
Please note, that I have also added the dot to the character class (is this a requirement?), otherwise your second string won't be matched. If this is not what you want, delete it from the class.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With