Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replace or remove new line "\n" character from Spark dataset column value

I have below code to read xml

Dataset<Row> dataset1 = SparkConfigXMLProcessor.sparkSession.read().format("com.databricks.spark.xml")
                .option("rowTag", properties.get(EventHubConsumerConstants.IG_ORDER_TAG).toString())
                .load(properties.get("C:\\inputOrders.xml").toString());

one of the column value getting new line character. i want to replace it with some character or just want to remove it. Please help

like image 463
Sudeep Singh Thakur Avatar asked Sep 19 '25 23:09

Sudeep Singh Thakur


2 Answers

dataset1.withColumn("menuitemname_clean", regexp_replace(col("menuitemname"), "[\n\r]", " "))

Above code will work

like image 149
Yawar Avatar answered Sep 22 '25 12:09

Yawar


This is what I used. I usually add a tab (\t), too. Having both \r and \n will find UNIX (\n), Windows (\r), and OSX (\r) newlines.

Dataset<Row> newDF = dataset1.withColumn("menuitemname", regexp_replace(col("menuitemname"), "\n|\r", ""));
like image 22
Richard Haussmann Avatar answered Sep 22 '25 14:09

Richard Haussmann