I'm facing a little issue when creating a dataframe:
from pyspark.sql import SparkSession, types
spark = SparkSession.builder.appName('test').getOrCreate()
df_test = spark.createDataFrame(
['a string', 1],
schema = [
types.StructField('col1', types.StringType(), True),
types.StructField('col2', types.IntegerType(), True)
]
)
## AttributeError: 'StructField' object has no attribute 'encode'
I don't see anything wrong with my code (it's so simple I feel really dumb). But I can't get this to work. Can you point me in the right direction?
The StructType and StructFields are used to define a schema or its part for the Dataframe. This defines the name, datatype, and nullable flag for each column. StructType object is the collection of StructFields objects. It is a Built-in datatype that contains the list of StructField.
We can create a DataFrame programmatically using the following three steps. Create an RDD of Rows from an Original RDD. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.
You can find all column names & data types (DataType) of PySpark DataFrame by using df. dtypes and df. schema and you can also retrieve the data type of a specific column name using df. schema["name"].
You were most of the way there!
When you call createDataFrame
specifying a schema, the schema needs to be a StructType
. An ordinary list isn't enough.
- Create an RDD of tuples or lists from the original RDD;
- Create the schema represented by a
StructType
matching the structure of tuples or lists in the RDD created in the step 1.- Apply the schema to the RDD via createDataFrame method provided by SparkSession.
Also, the first field in createDataFrame
is a list of rows, not a list of values for one row. So a single one-dimensional list will cause errors. Wrapping it in a dict that explicitly identifies which columns hold which values is one solution, but there might be others.
The result should look something like:
df_test = spark.createDataFrame(
[{'col1': 'a string', 'col2': 1}],
schema = types.StructType([
types.StructField('col1', types.StringType(), True),
types.StructField('col2', types.IntegerType(), True)
])
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With