Spark Dataframe validating column names for parquet writes

Tags:

I'm processing events using Dataframes converted from a stream of JSON events which eventually gets written out as Parquet format.

However, some of the JSON events contains spaces in the keys which I want to log and filter/drop such events from the data frame before converting it to Parquet because ;{}()\n\t= are considered special characters in Parquet schema (CatalystSchemaConverter) as listed in [1] below and thus should not be allowed in the column names.

How can I do such validations in Dataframe on the column names and drop such an event altogether without erroring out the Spark Streaming job.

[1] Spark's CatalystSchemaConverter

def checkFieldName(name: String): Unit = {
  // ,;{}()\n\t= and space are special characters in Parquet schema
  checkConversionRequirement(
    !name.matches(".*[ ,;{}()\n\t=].*"),
    s"""Attribute name "$name" contains invalid character(s) among " ,;{}()\\n\\t=".
             |Please use alias to rename it.
           """.stripMargin.split("\n").mkString(" ").trim
  )
}

616

asked Jul 04 '16 19:07

codehammer

2 Answers

For everyone experiencing this in pyspark: this even happened to me after renaming the columns. One way I could get this to work after some iterations is this:

file = "/opt/myfile.parquet"
df = spark.read.parquet(file)
for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", ""))

df = spark.read.schema(df.schema).parquet(file)

answered Sep 22 '22 03:09

Jan C. Schäfer

You can use a regex to replace all invalid characters with an underscore before you write into parquet. Additionally, strip accents from the column names too.

Here's a function normalize that do this for both Scala and Python :

Scala

/**
  * Normalize column name by replacing invalid characters with underscore
  * and strips accents
  *
  * @param columns dataframe column names list
  * @return the list of normalized column names
  */
def normalize(columns: Seq[String]): Seq[String] = {
  columns.map { c =>
    org.apache.commons.lang3.StringUtils.stripAccents(c.replaceAll("[ ,;{}()\n\t=]+", "_"))
  }
}

// using the function
val df2 = df.toDF(normalize(df.columns):_*)

Python

import unicodedata
import re

def normalize(column: str) -> str:
    """
    Normalize column name by replacing invalid characters with underscore
    strips accents and make lowercase
    :param column: column name
    :return: normalized column name
    """
    n = re.sub(r"[ ,;{}()\n\t=]+", '_', column.lower())
    return unicodedata.normalize('NFKD', n).encode('ASCII', 'ignore').decode()


# using the function
df = df.toDF(*map(normalize, df.columns))

answered Sep 24 '22 03:09

blackbishop

Related questions
                            
                                Spark Word2vec vector mathematics
                            
                                EMR Spark - TransportClient: Failed to send RPC
                            
                                Spark: Why does Python significantly outperform Scala in my use case?
                            
                                How to find the most recent partition in HIVE table
                            
                                Extracting `Seq[(String,String,String)]` from spark DataFrame
                            
                                Spark without Hadoop: Failed to Launch
                            
                                converting pandas dataframes to spark dataframe in zeppelin
                            
                                Getting NullPointerException when running Spark Code in Zeppelin 0.7.1
                            
                                Creating Spark dataframe from numpy matrix
                            
                                Why does Spark Planner prefer sort merge join over shuffled hash join?
                            
                                Kafka topic partitions to Spark streaming
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags
                            
                                Why does Spark job fail with "Exit code: 52"
                            
                                How to explode columns?
                            
                                Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'
                            
                                How to get word details from TF Vector RDD in Spark ML Lib?
                            
                                Cleaning up Spark history logs
                            
                                Partitioning by multiple columns in PySpark with columns in a list
                            
                                Sparksql filtering (selecting with where clause) with multiple conditions
                            
                                How to count a boolean in grouped Spark data frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Dataframe validating column names for parquet writes

Tags:

apache-spark

apache-spark-sql

pyspark

parquet

spark-streaming