Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read CSV with linebreaks in pyspark

Read CSV with linebreaks in pyspark I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:

enter image description here

I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.

Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?

like image 834
mjimcua Avatar asked Sep 14 '17 12:09

mjimcua


People also ask

What is multiline in CSV?

Use this tool to convert CSV into a multi-line data file. Multi-line is a plain text file where each field value is on a separate line and there is a separator line between each record. If a field is multi-line, then the line separator is converted into a space.

What is multiline in PySpark?

Solution: PySpark JSON data source API provides the multiline option to read records from multiple lines. By default, PySpark considers every record in a JSON file as a fully qualified record in a single line.


2 Answers

You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)

In Spark 2.2 there was added new option - wholeFile. If you write this:

spark.read.option("wholeFile", "true").csv("file.csv")

it will read all file and handle multiline CSV.

There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison

like image 122
T. Gawęda Avatar answered Oct 14 '22 08:10

T. Gawęda


wholeFile does not exist (anymore?) in the spark api documentation: https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html

This solution will work:

spark.read.option("multiLine", "true").csv("file.csv")

From the api documentation:

multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

like image 34
Jurrit Avatar answered Oct 14 '22 08:10

Jurrit