Spark read parquet with custom schema

Tags:

I'm trying to import data with parquet format with custom schema but it returns : TypeError: option() missing 1 required positional argument: 'value'

   ProductCustomSchema = StructType([
        StructField("id_sku", IntegerType(), True),
        StructField("flag_piece", StringType(), True),
        StructField("flag_weight", StringType(), True),
        StructField("ds_sku", StringType(), True),
        StructField("qty_pack", FloatType(), True)])

def read_parquet_(path, schema) : 
    return spark.read.format("parquet")\
                             .option(schema)\
                             .option("timestampFormat", "yyyy/MM/dd HH:mm:ss")\
                             .load(path)

product_nomenclature = 'C:/Users/alexa/Downloads/product_nomenc'
product_nom = read_parquet_(product_nomenclature, ProductCustomSchema)

352

asked Sep 18 '18 12:09

user9176398

1 Answers

As mentioned in the comments you should change .option(schema) to .schema(schema). option() requires you to specify a key (the name of the option you're setting) and a value (what value you want to assign to that option). You are getting the TypeError because you were just passing a variable called schema to option without specifying what that option you were actually trying to set with that variable.

The QueryExecutionException you posted in the comments is being raised because the schema you've defined in your schema variable does not match the data in your DataFrame. If you're going to specify a custom schema you must make sure that schema matches the data you are reading. In your example the column id_sku is stored as a BinaryType, but in your schema you're defining the column as an IntegerType. pyspark will not try to reconcile differences between the schema you provide and what the actual types are in the data and an exception will be thrown.

To fix your error make sure the schema you're defining correctly represents your data as it is stored in the parquet file (i.e. change the datatype of id_sku in your schema to be BinaryType). The benefit to doing this is you get a slight performance gain by not having to infer the file schema each time the parquet file is read.

103

answered Oct 26 '22 07:10

vielkind

Related questions
                            
                                spark <console>:12: error: not found: value sc
                            
                                Why are aggregate and fold two different APIs in Spark?
                            
                                Spark can no longer execute jobs. Executors fail to create directory
                            
                                SparkSQL MissingRequirementError when registering table
                            
                                How to get Histogram of all columns in a large CSV / RDD[Array[double]] using Apache Spark Scala?
                            
                                How to control number of parquet files generated when using partitionBy
                            
                                Numpy and static linking
                            
                                Difference between Apache spark mllib.linalg vectors and spark.util vectors for machine learning
                            
                                Spark Exception : Task failed while writing rows
                            
                                Spark netlib-java BLAS
                            
                                how to make RMSE(root mean square error) small when use ALS of spark?
                            
                                ALS model - how to generate full_u * v^t * v?
                            
                                Apache Toree to connect to a remote spark cluster
                            
                                Custom log4j.properties on AWS EMR
                            
                                (python) Spark .textFile(s3://...) access denied 403 with valid credentials
                            
                                Reading JSON files into Spark Dataset and adding columns from a separate Map
                            
                                How do I interpret Input size / records in Spark Stage UI
                            
                                my spark sql limit is very slow
                            
                                Why do I get a “Hive support is required to CREATE Hive TABLE (AS SELECT)” error when creating a table?
                            
                                Spark 2.3+ use of parquet.enable.dictionary?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark read parquet with custom schema

Tags:

apache-spark

apache-spark-sql

pyspark

user9176398

People also ask

1 Answers

vielkind

Recent Activity

Donate For Us