Dealing with commas within a field in a csv file using pyspark

Question

I have a csv data file containing commas within a column value. For example,

value_1,value_2,value_3  
AAA_A,BBB,B,CCC_C

Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".

How to get the right values after splitting the line by commas in PySpark?

DanielVL · Accepted Answer

Use spark-csv class from databriks.

Delimiters between quotes, by default ("), are ignored.

Example:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

For more info, review https://github.com/databricks/spark-csv

If your quote is (') instance of ("), you could configure with this class.

EDIT:

For python API:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')

Best regards.

Dealing with commas within a field in a csv file using pyspark

Tags:

csv

apache-spark

pyspark

sammy

1 Answers

DanielVL

Recent Activity

Donate For Us

Dealing with commas within a field in a csv file using pyspark

Tags:

csv

apache-spark

pyspark

sammy

1 Answers

DanielVL

Related questions

Recent Activity

Donate For Us