I have a csv data file containing commas within a column value. For example,
value_1,value_2,value_3
AAA_A,BBB,B,CCC_C
Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".
How to get the right values after splitting the line by commas in PySpark?
Use spark-csv class from databriks.
Delimiters between quotes, by default ("), are ignored.
Example:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
For more info, review https://github.com/databricks/spark-csv
If your quote is (') instance of ("), you could configure with this class.
EDIT:
For python API:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')
Best regards.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With