I'm experimenting with Spark-CSV package (https://github.com/databricks/spark-csv) for reading csv
files into Spark DataFrames
.
Everything works but all columns are assumed to be of StringType
.
As shown in Spark SQL documentation (https://spark.apache.org/docs/latest/sql-programming-guide.html), for built-in sources such as JSON, the schema with data types can be inferred automatically.
Can the types of columns in CSV file be inferred automatically?
Define data types in a CSV file. To define data types for CSV data source, set special prefixes before columns names. A string value. Available aggregations: Count and Distinct Count.
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes.
Inferring the Schema Using Reflection The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns.
Starting from Spark 2 we can use option 'inferSchema' like this: getSparkSession().read().option("inferSchema", "true").csv("YOUR_CSV_PATH")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With