I am having a dataframe which has some multi-line observations:
+--------------------+----------------+
| col1| col2|
+--------------------+----------------+
|something1 |somethingelse1 |
|something2 |somethingelse2 |
|something3 |somethingelse3 |
|something4 |somethingelse4 |
|multiline
row | somethings|
|something |somethingall |
What I want is to save in csv
format(or txt
) this dataframe. Using the following:
df
.write
.format("csv")
.save("s3://../adf/")
But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:
df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")
but the same output was observed.
I can imagine that one way is to replace \n
with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?
In the Spark shell you can wrap your multiple line Spark code in parenthesis to execute the code. Wrapping in parenthesis will allow you to copy multiple line Spark code into the shell or write multiple line code line-by-line.
Solution: PySpark JSON data source API provides the multiline option to read records from multiple lines. By default, PySpark considers every record in a JSON file as a fully qualified record in a single line.
You can use either backslash or parenthesis to break the lines in pyspark as you do in python.
Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting
sparkSession.read
.option("parserLib", "univocity")
.option("multiLine", "true")
.csv(file)
Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With