How to handle multi line rows in spark?

Tags:

apache-spark

I am having a dataframe which has some multi-line observations:

+--------------------+----------------+
|         col1|               col2|
+--------------------+----------------+
|something1           |somethingelse1  |
|something2           |somethingelse2  |
|something3           |somethingelse3  |
|something4           |somethingelse4  |
|multiline

 row               |     somethings|
|something            |somethingall    |

What I want is to save in csv format(or txt) this dataframe. Using the following:

df
 .write
 .format("csv")
 .save("s3://../adf/")

But when I check the file it seperates the observations to multiple lines. What I want is the lines that have 'multiline' observatios to be one the same row in the txt/csv file. I tried to save it as txt file:

df
.as[(String,String)]
.rdd
.saveAsTextFile("s3://../adf")

but the same output was observed.

I can imagine that one way is to replace \n with something else and after when loading back do the reverse function. But Is there a way to save it in the desired way without doing any kind of transformation to the data?

948

asked Sep 25 '17 15:09

Mpizos Dimitris

1 Answers

Assuming the multi-line data is properly quoted, you can parse multi-line csv data using the univocity parser and the multiLine setting

sparkSession.read
  .option("parserLib", "univocity")
  .option("multiLine", "true")
  .csv(file)

Note that this requires reading the entire file onto as single executor, and may not work if your data is too large. The standard text file reading will split the file by lines before doing any other parsing which will prevent you from working with data records containing newlines unless there is a different record delimiter you can use. If not you may need to implement a custom TextInputFormat to handle multiline records.

165

answered Sep 21 '22 23:09

puhlen

Related questions
                            
                                Pattern matching on List[T] and Set[T] in Scala vs. Haskell: effects of type erasure
                            
                                How to enumerate shapeless Record and access field keys in runtime?
                            
                                How to achieve high concurrency with spray.io in this Future and Thread.sleep example?
                            
                                How to parse JSON with variable keys in Scala Play?
                            
                                what does "::" mean in Scala ? [duplicate]
                            
                                Scala projects look awful with Eclipse Dark Theme [closed]
                            
                                How does the <:< operator work in Scala?
                            
                                com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'play'
                            
                                Not able to declare String type accumulator
                            
                                Scala: Creating a list of tuples from list elements sequentially
                            
                                How do I validate a Numeric character in scala? [duplicate]
                            
                                Scala: Predicate does not hold Exception
                            
                                Spark DataFrame handing empty String in OneHotEncoder
                            
                                ScalikeJDBC: Connection pool is not yet initialized.(name:'default)
                            
                                Convert string to timestamp for Spark using Scala
                            
                                Spark SQL fails because "Constant pool has grown past JVM limit of 0xFFFF"
                            
                                Finding current username with scala
                            
                                scala 2.12.1 ClassNotFoundException Product$class
                            
                                How to get path to the uploaded file
                            
                                How to suppress the "Stage 2===>" from the output console in spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With