Spark DataFrame from pandas Series

Question

I have a Pandas Series object

dates = pd.Series(pd.date_range(start_date,end_date))/
.dt.strftime('%y%m%d')/
.astype(int)/

And i would like to create a Spark DF directly from the Series object, without intermediate Pandas dataframe

    _schema = StructType([
     StructField("date_id", IntegerType(), True),
])

    dates_rdd = sc.parallelize(dates)
    self.date_table = spark.createDataFrame(dates_rdd, _schema)

Error:

Error: raise TypeError("StructType can not accept object %r in type %s" % 
(obj, type(obj)))
TypeError: StructType can not accept object 160101 in type <class 
'numpy.int64'>

If i change the Series object as:

    dates = pd.Series(pd.date_range(start_date,end_date))/
    .dt.strftime('%y%m%d')/
    .astype(int).values.tolist()

Error becomes:

 raise TypeError("StructType can not accept object %r in type %s" % (obj, 
 type(obj)))
 TypeError: StructType can not accept object 160101 in type <class 'int'>

How can i properly map the Int values contained in the dates list/rdd to Python native integer that is accepted from Spark Dataframes?

ags29 · Accepted Answer

This will work:

dates_rdd = sc.parallelize(dates).map(lambda x: tuple([int(x)]))
date_table = spark.createDataFrame(dates_rdd, _schema)

The purpose of the additional map in defining dates_rdd is to make the format of the rdd match the schema

Spark DataFrame from pandas Series

Tags:

python

pandas

series

apache-spark

pyspark

balalaika

1 Answers

ags29

Recent Activity

Donate For Us

Spark DataFrame from pandas Series

Tags:

python

pandas

series

apache-spark

pyspark

balalaika

1 Answers

ags29

Related questions

Recent Activity

Donate For Us