Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark DataFrame from pandas Series

I have a Pandas Series object

dates = pd.Series(pd.date_range(start_date,end_date))/
.dt.strftime('%y%m%d')/
.astype(int)/

And i would like to create a Spark DF directly from the Series object, without intermediate Pandas dataframe

    _schema = StructType([
     StructField("date_id", IntegerType(), True),
])

    dates_rdd = sc.parallelize(dates)
    self.date_table = spark.createDataFrame(dates_rdd, _schema)

Error:

Error: raise TypeError("StructType can not accept object %r in type %s" % 
(obj, type(obj)))
TypeError: StructType can not accept object 160101 in type <class 
'numpy.int64'>

If i change the Series object as:

    dates = pd.Series(pd.date_range(start_date,end_date))/
    .dt.strftime('%y%m%d')/
    .astype(int).values.tolist()

Error becomes:

 raise TypeError("StructType can not accept object %r in type %s" % (obj, 
 type(obj)))
 TypeError: StructType can not accept object 160101 in type <class 'int'>

How can i properly map the Int values contained in the dates list/rdd to Python native integer that is accepted from Spark Dataframes?

like image 725
balalaika Avatar asked Nov 25 '25 22:11

balalaika


1 Answers

This will work:

dates_rdd = sc.parallelize(dates).map(lambda x: tuple([int(x)]))
date_table = spark.createDataFrame(dates_rdd, _schema)

The purpose of the additional map in defining dates_rdd is to make the format of the rdd match the schema

like image 56
ags29 Avatar answered Nov 27 '25 12:11

ags29



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!