I am trying to make sure that a particular column in a dataframe does not contain any illegal values (non- numerical data). For this purpose I am trying to use a regex matching using rlike
to collect illegal values in the data:
I need to collect the values with string characters or spaces or commas or any other characters that are not like numbers. I tried:
spark.sql("select * from tabl where UPC not rlike '[0-9]*'").show()
but this doesn't work. it produces 0 rows.
Any help is appreciated. Thank you.
Spark SQL rlike() Function Similar to SQL regexp_like(), Spark SQL have rlike() that takes regular expression (regex) as input and matches the input column value with the regular expression.
Spark SQL, or Apache Hive does not provide support for is numeric function. You have to write a user defined function using your favorite programming language and register it in Spark or use alternative SQL option to check numeric values.
Check whether all characters in each string are alphanumeric. This is equivalent to running the Python string method str. isalnum() for each element of the Series/Index.
rlike
is looking for any match within the string. The asterisk (*) means 0 or many.
Having zero numbers somewhere in a string applies to every possible string.
You need to specify that you want to match from beginning ^
til the end of string $
spark.sql("select * from tabl where UPC not rlike '^[0-9]*$'").show()
alternatively you can also match for any single non numeric character within the string [^0-9]
spark.sql("select * from tabl where UPC rlike '[^0-9]'").show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With