Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: How to check if a column contains a number using isnan [duplicate]

I have a dataframe which looks like this:

+------------------------+----------+
|Postal code             |PostalCode|
+------------------------+----------+
|Muxía                   |null      |
|Fuensanta               |null      |
|Salobre                 |null      |
|Bolulla                 |null      |
|33004                   |null      |
|Santa Eulàlia de Ronçana|null      |
|Cabañes de Esgueva      |null      |
|Vallarta de Bureba      |null      |
|Villaverde del Monte    |null      |
|Villaluenga del Rosario |null      |
+------------------------+----------+

If the Postal code column contains only numbers, I want to create a new column where only numerical postal codes are stored. If the postal code column contains only text, want to create an new column called 'Municipality'.

I tried to use 'isnan' as my understanding this will check if a value is not a number, but this does not seem to work. Should the column type be string for this to work or?

So far my attempt are:

> df2 = df.withColumn('PostalCode', when(isnan(df['Postal code']), df['Postal code']) 

Looking at the dataframe results example posted above, you can see all values 'Null' are returned for new column, also for postal code '33004'

Any ideas will be much appreciated

like image 703
Juanita Smith Avatar asked Jan 30 '23 11:01

Juanita Smith


1 Answers

isnan only returns true if the column contains an mathematically invalid number, for example 5/0. In any other case, including strings, it will return false. If you want to check if a column contains a numerical value, you need to define your own udf, for example as shown below:

from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

df = spark.createDataFrame([('33004', ''),('Muxia', None), ('Fuensanta', None)], ("Postal code", "PostalCode"))

def is_digit(value):
    if value:
        return value.isdigit()
    else:
        return False

is_digit_udf = udf(is_digit, BooleanType())

df = df.withColumn('PostalCode', when(is_digit_udf(df['Postal code']), df['Postal code']))
df = df.withColumn('Municipality', when(~is_digit_udf(df['Postal code']), df['Postal code']))
df.show()

This gives as output:

+-----------+----------+------------+
|Postal code|PostalCode|Municipality|
+-----------+----------+------------+
|      33004|     33004|        null|
|      Muxia|      null|       Muxia|
|  Fuensanta|      null|   Fuensanta|
+-----------+----------+------------+  
like image 88
Alex Avatar answered May 01 '23 02:05

Alex