Initial data is in Dataset<Row> and I am trying to write to pipe delimited file and I want each non empty cell and non null values to be placed in quotes. Empty or null values should not contain quotes <pre class="prettyprint lang-java prettyprint-override"><code>result.coalesce(1).write() .option("delimiter", "|") .option("header", "true") .option("nullValue", "") .option("quoteAll", "false") .csv(Location); </code></pre> Expected output: <pre class="prettyprint"><code>"London"||"UK" "Delhi"|"India" "Moscow"|"Russia" </code></pre> Current Output: <pre class="prettyprint"><code>London||UK Delhi|India Moscow|Russia </code></pre> If I change the "quoteAll" to "true", output I am getting is: <pre class="prettyprint"><code>"London"|""|"UK" "Delhi"|"India" "Moscow"|"Russia" </code></pre> Spark version is 2.3 and java version is java 8

Java answer. CSV escape is not just adding " symbols around. You should handle " inside strings. So let's use StringEscapeUtils and define UDF that will call it. Then just apply the UDF to each of the column. <pre class="prettyprint"><code>import org.apache.commons.text.StringEscapeUtils; import org.apache.spark.sql.Column; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import static org.apache.spark.sql.functions.*; import org.apache.spark.sql.expressions.UserDefinedFunction; import org.apache.spark.sql.types.DataTypes; import java.util.Arrays; public class Test { void test(Dataset<Row> result, String Location) { // define UDF UserDefinedFunction escape = udf( (String str) -> str.isEmpty()?"":StringEscapeUtils.escapeCsv(str), DataTypes.StringType ); // call udf for each column Column columns[] = Arrays.stream(result.schema().fieldNames()) .map(f -> escape.apply(col(f)).as(f)) .toArray(Column[]::new); // save the result result.select(columns) .coalesce(1).write() .option("delimiter", "|") .option("header", "true") .option("nullValue", "") .option("quoteAll", "false") .csv(Location); } } </code></pre> Side note: coalesce(1) is a bad call. It collect all data on one executor. You can get executor OOM in production for huge dataset.

Writing CSV file using Spark and java - handling empty values and quotes

Tags:

java

csv

java-8

apache-spark

apache-spark-2.3

Initial data is in Dataset<Row> and I am trying to write to pipe delimited file and I want each non empty cell and non null values to be placed in quotes. Empty or null values should not contain quotes

result.coalesce(1).write()
            .option("delimiter", "|")
            .option("header", "true")
            .option("nullValue", "")
            .option("quoteAll", "false")
            .csv(Location);

Expected output:

"London"||"UK"
"Delhi"|"India"
"Moscow"|"Russia"

Current Output:

London||UK
Delhi|India
Moscow|Russia

If I change the "quoteAll" to "true", output I am getting is:

"London"|""|"UK"
"Delhi"|"India"
"Moscow"|"Russia"

Spark version is 2.3 and java version is java 8

237

asked Feb 26 '20 16:02

Ram Grandhi

2 Answers

Java answer. CSV escape is not just adding " symbols around. You should handle " inside strings. So let's use StringEscapeUtils and define UDF that will call it. Then just apply the UDF to each of the column.

import org.apache.commons.text.StringEscapeUtils;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import static org.apache.spark.sql.functions.*;
import org.apache.spark.sql.expressions.UserDefinedFunction;
import org.apache.spark.sql.types.DataTypes;

import java.util.Arrays;

public class Test {

    void test(Dataset<Row> result, String Location) {
        // define UDF
        UserDefinedFunction escape = udf(
            (String str) -> str.isEmpty()?"":StringEscapeUtils.escapeCsv(str), DataTypes.StringType
        );
        // call udf for each column
        Column columns[] = Arrays.stream(result.schema().fieldNames())
                .map(f -> escape.apply(col(f)).as(f))
                .toArray(Column[]::new);

         // save the result
        result.select(columns)
                .coalesce(1).write()
                .option("delimiter", "|")
                .option("header", "true")
                .option("nullValue", "")
                .option("quoteAll", "false")
                .csv(Location);
    }
}

Side note: coalesce(1) is a bad call. It collect all data on one executor. You can get executor OOM in production for huge dataset.

171

answered Oct 21 '22 05:10

Artem Aliev

EDIT & Warning: Did not see java tag. This is Scala solution that uses foldLeft as a loop to go over all columns. If this is replaced by a Java friendly loop, everything should work as is. I will try and look back at this at the later time.

A programmatic solution could be

val columns = result.columns
val randomColumnName = "RND"

val result2 = columns.foldLeft(result) { (data, column) =>
data
  .withColumnRenamed(column, randomColumnName)
  .withColumn(column,
    when(col(randomColumnName).isNull, "")
      .otherwise(concat(lit("\""), col(randomColumnName), lit("\"")))
  )
  .drop(randomColumnName)
}

This will produce the strings with " around them and write empty strings in nulls. If you need to keep nulls, just keep them.

Then just write it down:

result2.coalesce(1).write()
            .option("delimiter", "|")
            .option("header", "true")
            .option("quoteAll", "false")
            .csv(Location);

answered Oct 21 '22 04:10

Saša Zejnilović

Related questions
                            
                                Create JPEG thumb image with general fixed header
                            
                                How to find a last occurrence of set of characters in string using regex in java?
                            
                                Why am I getting this InvalidPathException
                            
                                How can I format day and month in the locale-correct order in Java?
                            
                                ES Rest High Level Client throws SocketTimeoutException after being idle for sometime
                            
                                java.lang.UnsatisfiedLinkError: No implementation found for java.lang.String when proguard enabled
                            
                                how to convert unix epoch microseconds into Java time stamp
                            
                                Spring 5 Security OAuth2 Login Redirect Loop
                            
                                @ControllerAdvice and @ExceptionHandler not getting triggered for my RestController
                            
                                Minimum number of powers of 2 to get an Integer?
                            
                                Reified generic parameter inside coroutine is not working
                            
                                How to add client certificates to the Spring WebClient?
                            
                                How to get rid of from SpreadOperator performance warning that was given by Detekt while using Spring Boot?
                            
                                How do lambda calls interact with Interfaces?
                            
                                Can spring data r2dbc generate a schema?
                            
                                Using Spring Boot 2 OAuth Client and Resourceserver in the same context
                            
                                Spring Boot: Default serialization for java.time.Duration changed from String to Number
                            
                                Make a POST call to GraphQL API programmatically using Java
                            
                                Adding and displaying data from a locally stored GeoJSON file using MapBox
                            
                                Patch Java 9 module with test-code to work with reflections

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With