Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark error on creating dataframe: 'StructField' object has no attribute 'encode'

Tags:

python

pyspark

I'm facing a little issue when creating a dataframe:

from pyspark.sql import SparkSession, types

spark = SparkSession.builder.appName('test').getOrCreate()

df_test = spark.createDataFrame(
    ['a string', 1],
    schema = [
        types.StructField('col1', types.StringType(), True),
        types.StructField('col2', types.IntegerType(), True)
    ]
)

## AttributeError: 'StructField' object has no attribute 'encode'

I don't see anything wrong with my code (it's so simple I feel really dumb). But I can't get this to work. Can you point me in the right direction?

like image 412
Barranka Avatar asked Apr 23 '19 15:04

Barranka


People also ask

What is StructType and StructField in spark?

The StructType and StructFields are used to define a schema or its part for the Dataframe. This defines the name, datatype, and nullable flag for each column. StructType object is the collection of StructFields objects. It is a Built-in datatype that contains the list of StructField.

How do I add a schema to a data frame?

We can create a DataFrame programmatically using the following three steps. Create an RDD of Rows from an Original RDD. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

How do I get the schema of a column PySpark?

You can find all column names & data types (DataType) of PySpark DataFrame by using df. dtypes and df. schema and you can also retrieve the data type of a specific column name using df. schema["name"].


1 Answers

You were most of the way there!

When you call createDataFrame specifying a schema, the schema needs to be a StructType. An ordinary list isn't enough.

  1. Create an RDD of tuples or lists from the original RDD;
  2. Create the schema represented by a StructType matching the structure of tuples or lists in the RDD created in the step 1.
  3. Apply the schema to the RDD via createDataFrame method provided by SparkSession.

Also, the first field in createDataFrame is a list of rows, not a list of values for one row. So a single one-dimensional list will cause errors. Wrapping it in a dict that explicitly identifies which columns hold which values is one solution, but there might be others.

The result should look something like:

df_test = spark.createDataFrame(
    [{'col1': 'a string', 'col2': 1}],
    schema = types.StructType([
        types.StructField('col1', types.StringType(), True),
        types.StructField('col2', types.IntegerType(), True)
    ])
)
like image 89
Jesse Amano Avatar answered Oct 02 '22 21:10

Jesse Amano