How to ignore double quotes when reading CSV file in Spark?

Question

I have a CSV file like:

col1,col2,col3,col4
"A,B","C", D"

I want to read it as a data frame in spark, where the values of every field are exactly as written in the CSV (I would like to treat the " character as a regular character, and copy it like any other character).

Expected output:

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|  "A|  B"| "C"|  D"|
+----+----+----+----+

The output I get:

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A,B|   C|  D"|null|
+----+----+----+----+

In pyspark I am reading like this:

dfr = spark.read.format("csv").option("header", "true").option("inferSchema", "true")

I know that if I add an option like this:

dfr.option("quote", "\u0000")

I get the expected result in the above example, as the function of char '"' is now done by '\u0000', but if my CSV file contains a '\u0000' char, I would also get the wrong result.

Therefore, my question is: How do I disable the quote option, so that no character acts like a quote?

My CSV file can contain any character, and I want all characters (except comas) to simply be copied into their respective data frame cell. I wonder if there is a way to accomplish this using the escape option.

pault · Accepted Answer

From the documentation for pyspark.sql.DataFrameReader.csv (emphasis mine):

quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.

dfr = spark.read.csv(
    path="path/to/some/file.csv",
    header="true",
    inferSchema="true",
    quote=""
)
dfr.show()
#+----+----+----+----+
#|col1|col2|col3|col4|
#+----+----+----+----+
#|  "A|  B"| "C"|  D"|
#+----+----+----+----+

How to ignore double quotes when reading CSV file in Spark?

Tags:

apache-spark

pyspark

someguy

1 Answers

pault

Recent Activity

Donate For Us

How to ignore double quotes when reading CSV file in Spark?

Tags:

apache-spark

pyspark

someguy

1 Answers

pault

Related questions

Recent Activity

Donate For Us