Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle multi line rows in spark?

I am having a dataframe which has some multi-line observations:

+--------------------+----------------+
|         col1|               col2|
+--------------------+----------------+
|something1           |somethingelse1  |
|something2           |somethingelse2  |
|something3           |somethingelse3  |
|something4           |somethingelse4  |
|multiline

 row               |     somethings|
|something            |somethingall    |

What I want is to save in csv format(or txt) this dataframe. Using the following:

df
 .write
 .format("csv")
 .save("s3://../adf/")

But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:

df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")

but the same output was observed.

I can imagine that one way is to replace \n with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?

like image 948
Mpizos Dimitris Avatar asked Sep 25 '17 15:09

Mpizos Dimitris


People also ask

How do I run multiple lines in spark shell?

In the Spark shell you can wrap your multiple line Spark code in parenthesis to execute the code. Wrapping in parenthesis will allow you to copy multiple line Spark code into the shell or write multiple line code line-by-line.

What is multiline option in PySpark?

Solution: PySpark JSON data source API provides the multiline option to read records from multiple lines. By default, PySpark considers every record in a JSON file as a fully qualified record in a single line.

How do you write multiple lines in PySpark?

You can use either backslash or parenthesis to break the lines in pyspark as you do in python.


1 Answers

Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting

sparkSession.read
  .option("parserLib", "univocity")
  .option("multiLine", "true")
  .csv(file)

Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.

like image 165
puhlen Avatar answered Sep 21 '22 23:09

puhlen