Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with commas within a field in a csv file using pyspark

I have a csv data file containing commas within a column value. For example,

value_1,value_2,value_3  
AAA_A,BBB,B,CCC_C  

Here, the values are "AAA_A","BBB,B","CCC_C". But, when trying to split the line by comma, it is giving me 4 values, i.e. "AAA_A","BBB","B","CCC_C".

How to get the right values after splitting the line by commas in PySpark?

like image 835
sammy Avatar asked Feb 23 '16 06:02

sammy


1 Answers

Use spark-csv class from databriks.

Delimiters between quotes, by default ("), are ignored.

Example:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

For more info, review https://github.com/databricks/spark-csv

If your quote is (') instance of ("), you could configure with this class.

EDIT:

For python API:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('cars.csv')

Best regards.

like image 113
DanielVL Avatar answered Sep 30 '22 15:09

DanielVL