If I have an RDD of Key/Value (key being the column index) is it possible to load it into a dataframe? For example: <pre class="prettyprint"><code>(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18) </code></pre> And have the dataframe look like: <pre class="prettyprint"><code>1,2,18 1,10,18 2,20,18 </code></pre>

Yes it's possible (tested with Spark 1.3.1) : <pre class="prettyprint"><code>>>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)]) >>> sqlContext.createDataFrame(rdd, ["id", "score"]) Out[2]: DataFrame[id: bigint, score: bigint] </code></pre>

pySpark Create DataFrame from RDD with Key/Value

Tags:

apache-spark

pyspark

If I have an RDD of Key/Value (key being the column index) is it possible to load it into a dataframe? For example:

(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)

And have the dataframe look like:

1,2,18
1,10,18
2,20,18

212

asked May 02 '15 20:05

theMadKing

1 Answers

Yes it's possible (tested with Spark 1.3.1) :

>>> rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
>>> sqlContext.createDataFrame(rdd, ["id", "score"])
Out[2]: DataFrame[id: bigint, score: bigint]

133

answered Apr 10 '23 19:04

Olivier Girardot

Related questions
                            
                                How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?
                            
                                Understanding Spark Structured Streaming Parallelism
                            
                                _pickle.PicklingError: Could not serialize object: TypeError: can't pickle _thread.RLock objects
                            
                                Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each
                            
                                Error when converting from spark dataframe with dates to pandas dataframe
                            
                                Use spark-submit to submit a application to EC2 cluster
                            
                                Spark with Cassandra input/output
                            
                                Increase memory available to Spark shell
                            
                                How to transform a categorical variable in Spark into a set of columns coded as {0,1}?
                            
                                Geoip2's python library doesn't work in pySpark's map function
                            
                                Spark ml and PMML export
                            
                                Why are Spark Parquet files for an aggregate larger than the original?
                            
                                How to write null value from Spark sql expression of DataFrame to a database table? (IllegalArgumentException: Can't get JDBC type for null)
                            
                                Missing hive-site when using spark-submit YARN cluster mode
                            
                                AWS connection timeout when running Spark job on EMR
                            
                                Spark - how to get top N of rdd as a new rdd (without collecting at the driver)
                            
                                Apache Livy doesn't work with local jar file
                            
                                RDD CountApproximate taking far longer than requested timeout
                            
                                Limit kafka batch size when using Spark Structured Streaming
                            
                                RDD filter in scala spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With