Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark regex string matching

I have a strings in a dataframe in the following format.

abc.T01.xyz
abc.def.T01.xyz
abc.def.ghi.xyz

I need to filter the rows where this string has values matching this expression.

[a-zA-Z].T[0-9].[a-zA-Z]

I have used the following command, but it is giving me the strings that look like this as well: [a-zA-Z].[a-zA-Z].T[0-9].[a-zA-Z] which I don't want in my result.

mydf2 = mydf1.where('col1 rlike ".*\.T.*\..*"')
mydf2.show()

I am missing something in my regex.

like image 946
akn Avatar asked Feb 10 '26 10:02

akn


1 Answers

Just translate your requirements instead of using a dot-star-soup and add anchors:

# [a-zA-Z].T[0-9].[a-zA-Z]
mydf2 = mydf1.where('col1 rlike "^[a-zA-Z.]+\.T[0-9]+\.[a-zA-Z.]+$"')

See a demo on regex101.com.
Please note, that I have also added the dot to the character class (is this a requirement?), otherwise your second string won't be matched. If this is not what you want, delete it from the class.

like image 71
Jan Avatar answered Feb 17 '26 02:02

Jan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!