Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ignore double quotes when reading CSV file in Spark?

I have a CSV file like:

col1,col2,col3,col4
"A,B","C", D"

I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV (I would like to treat the " character as a regular character, and copy it like any other character).

Expected output:

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  "A|  B"| "C"|  D"|
+----+----+----+----+

The output I get:

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A,B|   C|  D"|null|
+----+----+----+----+

In pyspark I am reading like this:

dfr = spark.read.format("csv").option("header", "true").option("inferSchema", "true")

I know that if I add an option like this:

dfr.option("quote", "\u0000")

I get the expected result in the above example, as the function of char '"' is now done by '\u0000', but if my CSV file contains a '\u0000' char, I would also get the wrong result.

Therefore, my question is: How do I disable the quote option, so that no character acts like a quote?

My CSV file can contain any character, and I want all characters (except comas) to simply be copied into their respective data frame cell. I wonder if there is a way to accomplish this using the escape option.

like image 521
someguy Avatar asked Dec 23 '22 01:12

someguy


1 Answers

From the documentation for pyspark.sql.DataFrameReader.csv (emphasis mine):

quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.

dfr = spark.read.csv(
    path="path/to/some/file.csv",
    header="true",
    inferSchema="true",
    quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#|  "A|  B"| "C"|  D"|
#+----+----+----+----+
like image 189
pault Avatar answered Jan 03 '23 03:01

pault