Create DataFrame from list of tuples using pyspark

Tags:

I am working with data extracted from SFDC using simple-salesforce package. I am using Python3 for scripting and Spark 1.5.2.

I created an rdd containing the following data:

[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')]
[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')]
[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]
...

This data is in RDD called v_rdd

My schema looks like this:

StructType(List(StructField(Id,StringType,true),StructField(PackSize,StringType,true),StructField(Name,StringType,true)))

I am trying to create DataFrame out of this RDD:

sqlDataFrame = sqlContext.createDataFrame(v_rdd, schema)

I print my DataFrame:

sqlDataFrame.printSchema()

And get the following:

+--------------------+--------------------+--------------------+
|                  Id|  PackSize|                          Name|
+--------------------+--------------------+--------------------+
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|
|[Ljava.lang.Objec...|[Ljava.lang.Objec...|[Ljava.lang.Objec...|

I am expecting to see actual data, like this:

+------------------+------------------+--------------------+
|                Id|PackSize|                          Name|
+------------------+------------------+--------------------+
|a0w1a0000003xB1A  |               1.0|       A            |
|a0w1a0000003xAAI  |               1.0|       B            |
|a0w1a00000xB3AAI  |              30.0|       C            |

Can you please help me identify what I am doing wrong here.

My Python script is long, I am not sure it would be convenient for people to sift through it, so I posted only parts I am having issue with.

Thank a ton in advance!

204

asked Jan 25 '16 20:01

Pit

1 Answers

Hey could you next time provide a working example. That would be easier.

The way how your RDD is presented is basically weird to create a DataFrame. This is how you create a DF according to Spark Documentation.

>>> l = [('Alice', 1)]
>>> sqlContext.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> sqlContext.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]

So concerning your example you can create your desired output like this way:

# Your data at the moment
data = sc.parallelize([ 
[('Id', 'a0w1a0000003xB1A'), ('PackSize', 1.0), ('Name', 'A')],
[('Id', 'a0w1a0000003xAAI'), ('PackSize', 1.0), ('Name', 'B')],
[('Id', 'a0w1a00000xB3AAI'), ('PackSize', 30.0), ('Name', 'C')]
    ])
# Convert to tuple
data_converted = data.map(lambda x: (x[0][1], x[1][1], x[2][1]))

# Define schema
schema = StructType([
    StructField("Id", StringType(), True),
    StructField("Packsize", StringType(), True),
    StructField("Name", StringType(), True)
])

# Create dataframe
DF = sqlContext.createDataFrame(data_converted, schema)

# Output
DF.show()
+----------------+--------+----+
|              Id|Packsize|Name|
+----------------+--------+----+
|a0w1a0000003xB1A|     1.0|   A|
|a0w1a0000003xAAI|     1.0|   B|
|a0w1a00000xB3AAI|    30.0|   C|
+----------------+--------+----+

Hope this helps

116

answered Oct 12 '22 02:10

Dat Tran

Related questions
                            
                                python 3 in emacs
                            
                                Cx-Freeze Error - Python 34
                            
                                pandas dataframe column name: remove special character
                            
                                Pymongo : insert_many + unique index
                            
                                Counting number of documents in an index in elasticsearch
                            
                                Cross-validation in sklearn: do I need to call fit() as well as cross_val_score()?
                            
                                Why does my Python3 script balk at piping its output to head or tail (sys module)?
                            
                                Python Tkinter: Attempt to get widget size
                            
                                PyCrypto - How does the Initialization Vector work?
                            
                                Python hasattr vs getattr
                            
                                How to use wxPython for Python 3?
                            
                                Python3 How to make a bytes object from a list of integers
                            
                                Convert A Column In Pandas to One Long String (Python 3)
                            
                                How to return a specific point after an error in 'while' loop
                            
                                Flask & Alchemy - (psycopg2.OperationalError) FATAL: password authentication failed
                            
                                python 3.6 *logging modul error* UnicodeEncodeError: 'charmap' codec can't encode characters
                            
                                Using the reserved word "class" as field name in Django and Django REST Framework
                            
                                How to replace None in the List with previous value
                            
                                IronPython 3 compatibility
                            
                                Python: Evenly space output data with varying string lengths

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Create DataFrame from list of tuples using pyspark

Tags:

python-3.x

pyspark

spark-dataframe

Pit

People also ask

1 Answers

Dat Tran

Recent Activity

Donate For Us