I have a CSV that looks like this:
+-----------------+-----------------+-----------------+
| Column One | Column Two | Column Three |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
| This is a value | This is a value | This is a value |
+-----------------+-----------------+-----------------+
In plain text, it actually looks like this:
Column One,Column Two,Column Three
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
This is a value,This is a value,This is a value
My spark.read
method looks like this:
val df = spark.read
.format("csv")
.schema(schema)
.option("quote", "\"")
.option("escape", "\"")
.option("header", "true")
.option("multiLine", "true")
.option("mode", "DROPMALFORMED")
.load(inputFilePath)
When multiLine
is set to true
, the df
loads as empty. It loads fine when multiLine
is set to false
, but I need multiLine
set to true
.
If I change the name of Column Three
to ColumnThree
, and also update that in the schema
object, then it works fine. It seems like multiLine
is being applied to the header row! I was hoping that wouldn't be the case when header
is also set to true
.
Any ideas how to get around this? Should I be using the univocity
parser instead of the default commons
?
UPDATE:
I don't know why that mocked data was working fine. Here's a closer representation of the data:
CSV (Just 1 header and 1 line of data...):
Digital ISBN,Print ISBN,Title,Price,File Name,Description,Book Cover File Name
97803453308,test,This is English,29.99,qwe_1.txt,test,test
Schema & the spark.read
method:
val df = spark.read
.format("csv")
.schema(StructType(Array(
StructField("Digital ISBN", StringType, true),
StructField("Print ISBN", StringType, true),
StructField("Title", StringType, true),
StructField("File Name", StringType, true),
StructField("Price", StringType, true),
StructField("Description", StringType, true),
StructField("Book Cover File Name", StringType, true)
)))
.option("quote", "\"")
.option("escape", "\"")
.option("header", "true")
.option("multiLine", "true")
.option("mode", "DROPMALFORMED")
.load(inputFilePath)
df.show()
result in spark-shell
:
+------------+----------+-----+---------+-----+-----------+--------------------+
|Digital ISBN|Print ISBN|Title|File Name|Price|Description|Book Cover File Name|
+------------+----------+-----+---------+-----+-----------+--------------------+
+------------+----------+-----+---------+-----+-----------+--------------------+
UDPATE 2:
I think I found "what's different". When I copy the data in the CSV and save it to another CSV, it works fine. But that original CSV (which was saved by Excel), fails... The CSV saved by Excel is 1290 bytes, while the CSV I created myself (which works fine) is 1292 bytes....
UPDATE 3:
I opened the two files mentioned in Update2 in vim
and noticed that the CSV saved by Excel had ^M
instead of new lines. All of my testing prior to this was flawed because it was always comparing a CSV originally saved by Excel vs a CSV created from Sublime... Sublime wasn't showing the difference. I'm sure there's a setting or package I can install to see that, because I use Sublime as my go-to one-off file editor...
Not sure if I should close this question since the title is misleading. Then again, there's gotta be some value to someone out there lol...
I was facing the same issue with the option of multiLine being applied to the header. I solved it by adding the additional option for ignoring trailing white space.
.option("header", true)
.option("multiLine", true)
.option("ignoreTrailingWhiteSpace", true)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With