Reading csv files with quoted fields containing embedded commas

Tags:

I am reading a csv file in Pyspark as follows:

df_raw=spark.read.option("header","true").csv(csv_path)

However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ? I know pandas can handle this, but can Spark ? The version I am using is Spark 2.0.0.

Here is an example which works in Pandas but fails using Spark:

In [1]: import pandas as pd  In [2]: pdf = pd.read_csv('malformed_data.csv')  In [3]: sdf=spark.read.format("org.apache.spark.csv").csv('malformed_data.csv',header=True)  In [4]: pdf[['col12','col13','col14']] Out[4]:                     col12                                             col13  \ 0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE 1                     NaN                     OUTKAST#THROOTS~WUTANG#RUNDMC     col14 0   23.0 1    0.0  In [5]: sdf.select("col12","col13",'col14').show() +------------------+--------------------+--------------------+ |             col12|               col13|               col14| +------------------+--------------------+--------------------+ |"32 XIY ""W""   JK|              RE LK"|SOMETHINGLIKEAPHE...| |              null|OUTKAST#THROOTS~W...|                 0.0| +------------------+--------------------+--------------------+

The contents of the file :

    col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17,col18,col19 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,cyclingstats,2012-25-19,432,2023-05-17,CODERED 61670000229561918,137.12,U,8234971771,,,woodstock,,,T4,,,OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,2019-11-23,CODEBLUE

379

asked Nov 04 '16 00:11

femibyte

2 Answers

I noticed that your problematic line has escaping that uses double quotes themselves:

"32 XIY ""W"" JK, RE LK"

which should be interpreter just as

32 XIY "W" JK, RE LK

As described in RFC-4180, page 2 -

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote

That's what Excel does, for example, by default.

Although in Spark (as of Spark 2.1), escaping is done by default through non-RFC way, using backslah (\). To fix this you have to explicitly tell Spark to use doublequote to use as an escape character:

.option("quote", "\"") .option("escape", "\"")

This may explain that a comma character wasn't interpreted correctly as it was inside a quoted column.

Options for Spark csv format are not documented well on Apache Spark site, but here's a bit older documentation which I still find useful quite often:

https://github.com/databricks/spark-csv

Update Aug 2018: Spark 3.0 might change this behavior to be RFC-compliant. See SPARK-22236 for details.

108

answered Sep 29 '22 06:09

Tagar

For anyone doing this in Scala: Tagar's answer nearly worked for me (thank you!); all I had to do was escape the double quote when setting my option param:

.option("quote", "\"") .option("escape", "\"")

I'm using Spark 2.3, so I can confirm Tagar's solution still seems to work the same under the new release.

answered Sep 29 '22 07:09

Allie Rogers

Related questions
                            
                                Parsing the first column of a csv file to a new file
                            
                                python csv unicode 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in range(128)
                            
                                Python3: writing csv files
                            
                                How to delete columns in a CSV file?
                            
                                Read CSV to list of objects [duplicate]
                            
                                Comma-separated string to NSArray in Objective-C
                            
                                Convert array into csv
                            
                                How to save a hash into a CSV
                            
                                read "SELECT *" columns into []string in go
                            
                                Is it possible to read categorical columns with pandas' read_csv?
                            
                                Excel: macro to export worksheet as CSV file without leaving my current Excel sheet
                            
                                Read first x lines of csv file into new outfile?
                            
                                Forcing fputcsv to Use Enclosure For *all* Fields
                            
                                How to convert a date in Excel to ISO 8601 format
                            
                                Save list of ordered tuples as CSV [duplicate]
                            
                                Writing Python lists to columns in csv
                            
                                Generate CSV based on MySQL query from phpMyAdmin
                            
                                CSV export in laravel 5 controller
                            
                                Writing a CSV file in .net
                            
                                How to delete a column from a data frame with pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading csv files with quoted fields containing embedded commas

Tags:

csv

apache-spark

apache-spark-sql

pyspark

apache-spark-2.0

femibyte

People also ask

2 Answers

Tagar

Allie Rogers

Recent Activity

Donate For Us