Read CSV with linebreaks in pyspark I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++: <img src="https://i.stack.imgur.com/FfA5F.png" alt="enter image description here"> I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version. Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?

You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :) In Spark 2.2 there was added new option - <code>wholeFile</code>. If you write this: <pre class="prettyprint"><code>spark.read.option("wholeFile", "true").csv("file.csv") </code></pre> it will read all file and handle multiline CSV. There is no such option in Spark 2.1. You can read file using <code>sparkContext.wholeTextFile</code> or just use newer verison

wholeFile does not exist (anymore?) in the spark api documentation: https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html This solution will work: <pre class="prettyprint"><code>spark.read.option("multiLine", "true").csv("file.csv") </code></pre> From the api documentation: multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

Read CSV with linebreaks in pyspark

Tags:

python-3.x

csv

apache-spark

pyspark

Read CSV with linebreaks in pyspark I want to read with pyspark a "legal" (it follows RFC4180) CSV that has breaklines (CRLF) in some of the rows. The next code sample shows how it does seem when opened it with Notepad++:

enter image description here

I try to read it with sqlCtx.read.load using format ='com.databricks.spark.csv. and the resulting dataset shows two rows instead of one in these specific cases. I am using Spark 2.1.0.2 version.

Is there any command or alternative way of reading the csv that allows me to read these two lines only as one?

834

asked Sep 14 '17 12:09

mjimcua

2 Answers

You can use "csv" instead of Databricks CSV - the last one redirects now to default Spark reader. But, it's only a hint :)

In Spark 2.2 there was added new option - wholeFile. If you write this:

spark.read.option("wholeFile", "true").csv("file.csv")

it will read all file and handle multiline CSV.

There is no such option in Spark 2.1. You can read file using sparkContext.wholeTextFile or just use newer verison

122

answered Oct 14 '22 08:10

T. Gawęda

wholeFile does not exist (anymore?) in the spark api documentation: https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html

This solution will work:

spark.read.option("multiLine", "true").csv("file.csv")

From the api documentation:

multiLine – parse records, which may span multiple lines. If None is set, it uses the default value, false

answered Oct 14 '22 08:10

Jurrit

Related questions
                            
                                loop to make every combination of several lists
                            
                                Weird Terminator error when running Python 3 Turtle in OS X
                            
                                Rendering out new line properly in openpyxl generated XLSX file
                            
                                OSerror when uploading files over a NFS
                            
                                Error sending mail with Google API - "'raw' RFC822 payload message string or uploading message via /upload/* URL required"
                            
                                Python3: Multiprocessing consumes extensively much RAM and slows down
                            
                                Evaluating a mathematical expression without eval() on Python3
                            
                                Stream a non-seekable file-like object to multiple sinks
                            
                                Pyinstaller .exe cannot find _tiffile module - Loading of some compressed images will be very slow
                            
                                Conda environment is discoverable but not activateable (when activate is a bash alias)
                            
                                flask: Writing directly to response stream
                            
                                Python: YAML dictionary of functions: how to load without converting to strings
                            
                                Concatenating selected strings in list of strings
                            
                                NotImplementedError: executemany is implemented for simple INSERT statements only
                            
                                Intellisense not recognising type hinting for python 3
                            
                                Python 3.4 ImportError: No module named '_gdal_array'No module named '_gdal_array'
                            
                                Python asyncio training exercises
                            
                                Connection Error while using requests to get response from google distance matrix api
                            
                                Cannot run Django heroku app locally on Windows
                            
                                Python: name of parent package not recognized in import statements

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With