Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark-csv data source: infer data types

I'm experimenting with Spark-CSV package (https://github.com/databricks/spark-csv) for reading csv files into Spark DataFrames.

Everything works but all columns are assumed to be of StringType.

As shown in Spark SQL documentation (https://spark.apache.org/docs/latest/sql-programming-guide.html), for built-in sources such as JSON, the schema with data types can be inferred automatically.

Can the types of columns in CSV file be inferred automatically?

like image 322
Oleg Shirokikh Avatar asked Apr 19 '15 03:04

Oleg Shirokikh


People also ask

Can CSV have data types?

Define data types in a CSV file. To define data types for CSV data source, set special prefixes before columns names. A string value. Available aggregations: Count and Distinct Count.

What option can be used to automatically infer the datatype of column?

Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes.

How does Spark infer the schema?

Inferring the Schema Using Reflection The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns.


1 Answers

Starting from Spark 2 we can use option 'inferSchema' like this: getSparkSession().read().option("inferSchema", "true").csv("YOUR_CSV_PATH")

like image 109
Olga Avatar answered Sep 19 '22 16:09

Olga