Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse datetime that is coming in Arabic text (٠٤-٢٥-٢٠٢١) to English dates in Pyspark

I am reading JSON file that has some date columns. The issue is some of the date columns contain dates in Arabic/urdu text :

٠٤-٢٥-٢٠٢١

I want to convert it to English date in yyyy-mm-dd format. How to achieve this in Pyspark?

like image 463
Atif Avatar asked Sep 11 '21 20:09

Atif


2 Answers

You can convert arabic number to english by casting type to decimal.

df = spark.createDataFrame([('٠٤-٢٥-٢٠٢١',)],['arabic'])

df.withColumn('split', split('arabic', '-')) \
.withColumn('date', concat_ws('-', col('split')[2].cast('decimal'), col('split')[0].cast('decimal'), col('split')[1].cast('decimal'))) \
.drop('split').show()

+----------+---------+
|    arabic|     date|
+----------+---------+
|٠٤-٢٥-٢٠٢١ |2021-4-25|
+----------+---------+
like image 162
Mohana B C Avatar answered Sep 22 '22 09:09

Mohana B C


Finally, I decided to use pandas_udf and python's unidecode library

from pyspark.sql.types import StringType
from pyspark.sql.functions import pandas_udf
from unidecode import unidecode
import pandas as pd

def unidecode_(val):
    if val:
        return unidecode(val)


@pandas_udf(StringType())
def a_to_n(col):
    return pd.Series(col.apply(unidecode_))

df = df_json.withColumn('checkin_date', a_to_n(F.col("checkin_date")))

It is giving me the desired answer.

like image 25
Atif Avatar answered Sep 18 '22 09:09

Atif