Spark Option: inferSchema vs header = true

Tags:

Reference to pyspark: Difference performance for spark.read.format("csv") vs spark.read.csv

I thought I needed .options("inferSchema" , "true") and .option("header", "true") to print my headers but apparently I could still print my csv with headers.

What is the difference between header and schema? I don't really understand the meaning of "inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default".

478

asked Jul 08 '19 01:07

user1342124

1 Answers

The header and schema are separate things.

Header:

If the csv file have a header (column names in the first row) then set header=true. This will use the first row in the csv file as the dataframe's column names. Setting header=false (default option) will result in a dataframe with default column names: _c0, _c1, _c2, etc.

Setting this to true or false should be based on your input file.

Schema:

The schema refered to here are the column types. A column can be of type String, Double, Long, etc. Using inferSchema=false (default option) will give a dataframe where all columns are strings (StringType). Depending on what you want to do, strings may not work. For example, if you want to add numbers from different columns, then those columns should be of some numeric type (strings won't work).

By setting inferSchema=true, Spark will automatically go through the csv file and infer the schema of each column. This requires an extra pass over the file which will result in reading a file with inferSchema set to true being slower. But in return the dataframe will most likely have a correct schema given its input.

As an alternative to reading a csv with inferSchema you can provide the schema while reading. This have the advantage of being faster than inferring the schema while giving a dataframe with the correct column types. In addition, for csv files without a header row, column names can be given automatically. To provde schema see e.g.: Provide schema while reading csv file as a dataframe

104

answered Nov 27 '22 02:11

Shaido

Related questions
                            
                                Convert CSV values to a HashMap key value pairs in JAVA
                            
                                Adding BOM to CSV file using fputcsv
                            
                                convert CSV lines into Javascript objects
                            
                                Embed csv in html rmarkdown
                            
                                Best way to organize tests in RSpec that have a Combinations of factor [closed]
                            
                                Exporting CSV properly open Office (save numbers as TEXT)
                            
                                Postgres: \copy syntax error in .sql file
                            
                                MySQL dump into CSV text files with column names at the top? [duplicate]
                            
                                Bash: sort csv file by first 4 columns
                            
                                How to read the csv file properly if each row contains different number of fields (number quite big)?
                            
                                "DataFrame" object has no attribute 'reshape'
                            
                                How do I handle line breaks in a CSV file using C#?
                            
                                Python's CSV reader and iteration
                            
                                R: Read csv with row and column name
                            
                                Multiple threads writing to the same CSV in Python
                            
                                How can I strip all line breaks to generate a proper CSV?
                            
                                Export data from Google App Engine to csv
                            
                                Python parse csv file - replace commas with colons
                            
                                Ignore header line when parsing CSV file
                            
                                How to join two tables using a comma-separated-list in the join field

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Option: inferSchema vs header = true

Tags:

csv

header

schema

apache-spark

apache-spark-sql

user1342124

People also ask

1 Answers

Shaido

Recent Activity

Donate For Us