If I have an RDD of Key/Value (key being the column index) is it possible to load it into a dataframe? For example:
(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)
And have the dataframe look like:
1,2,18
1,10,18
2,20,18
Method 1: Using createDataframe() function. After creating the RDD we have converted it to Dataframe using createDataframe() function in which we have passed the RDD and defined schema for Dataframe.
Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.
If you have semi-structured data, you can create DataFrame from the existing RDD by programmatically specifying the schema.
Convert Using createDataFrame Method The SparkSession object has a utility method for creating a DataFrame – createDataFrame. This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema.
Yes it's possible (tested with Spark 1.3.1) :
>>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
>>> sqlContext.createDataFrame(rdd, ["id", "score"])
Out[2]: DataFrame[id: bigint, score: bigint]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With