Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pySpark Create DataFrame from RDD with Key/Value

If I have an RDD of Key/Value (key being the column index) is it possible to load it into a dataframe? For example:

(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)

And have the dataframe look like:

1,2,18
1,10,18
2,20,18
like image 212
theMadKing Avatar asked May 02 '15 20:05

theMadKing


People also ask

Can we convert RDD to DataFrame in PySpark?

Method 1: Using createDataframe() function. After creating the RDD we have converted it to Dataframe using createDataframe() function in which we have passed the RDD and defined schema for Dataframe.

How do I convert RDD to dataset in PySpark?

Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.

Can we create DataFrame from RDD?

If you have semi-structured data, you can create DataFrame from the existing RDD by programmatically specifying the schema.

How do you convert a Spark RDD into a DataFrame?

Convert Using createDataFrame Method The SparkSession object has a utility method for creating a DataFrame – createDataFrame. This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema.


1 Answers

Yes it's possible (tested with Spark 1.3.1) :

>>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
>>> sqlContext.createDataFrame(rdd, ["id", "score"])
Out[2]: DataFrame[id: bigint, score: bigint]
like image 133
Olivier Girardot Avatar answered Apr 10 '23 19:04

Olivier Girardot