I am reading JSON file that has some date columns. The issue is some of the date columns contain dates in Arabic/urdu text :
٠٤-٢٥-٢٠٢١
I want to convert it to English date in yyyy-mm-dd
format.
How to achieve this in Pyspark?
You can convert arabic number to english by casting type to decimal.
df = spark.createDataFrame([('٠٤-٢٥-٢٠٢١',)],['arabic'])
df.withColumn('split', split('arabic', '-')) \
.withColumn('date', concat_ws('-', col('split')[2].cast('decimal'), col('split')[0].cast('decimal'), col('split')[1].cast('decimal'))) \
.drop('split').show()
+----------+---------+
| arabic| date|
+----------+---------+
|٠٤-٢٥-٢٠٢١ |2021-4-25|
+----------+---------+
Finally, I decided to use pandas_udf and python's unidecode library
from pyspark.sql.types import StringType
from pyspark.sql.functions import pandas_udf
from unidecode import unidecode
import pandas as pd
def unidecode_(val):
if val:
return unidecode(val)
@pandas_udf(StringType())
def a_to_n(col):
return pd.Series(col.apply(unidecode_))
df = df_json.withColumn('checkin_date', a_to_n(F.col("checkin_date")))
It is giving me the desired answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With