Is it possible to add extra meta data to <code>DataFrame</code>s? <h3>Reason</h3> I have Spark <code>DataFrame</code>s for which I need to keep extra information. Example: A <code>DataFrame</code>, for which I want to "remember" the highest used index in an Integer id column. <h3>Current solution</h3> I use a separate <code>DataFrame</code> to store this information. Of course, keeping this information separately is tedious and error-prone. Is there a better solution to store such extra information on <code>DataFrame</code>s?

To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame: <pre class="prettyprint"><code>import org.apache.spark.sql val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt") </code></pre> And some way to get the max or whatever you want to memoize on the DataFrame: <pre class="prettyprint"><code>val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max) </code></pre> <code>sql.types.Metadata</code> can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long: <pre class="prettyprint"><code>val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build() </code></pre> DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use <code>Column.as(alias, metadata)</code>: <pre class="prettyprint"><code>val newColumn = df.col("randInt").as("randInt_withMax", metadata) val dfWithMax = df.withColumn("randInt_withMax", newColumn) </code></pre> <code>dfWithMax</code> now has (a column with) the metadata you want! <pre class="prettyprint"><code>dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}")) > randInt: metadata={} > randInt_withMax: metadata={"columnMax":2094414111} </code></pre> Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception): <pre class="prettyprint"><code>dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax") > res29: Long = 209341992 </code></pre> Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.

Is there a way to add extra metadata for Spark dataframes?

Reason

I have Spark DataFrames for which I need to keep extra information. Example: A DataFrame, for which I want to "remember" the highest used index in an Integer id column.

Current solution

I use a separate DataFrame to store this information. Of course, keeping this information separately is tedious and error-prone.

Is there a better solution to store such extra information on DataFrames?

519

asked Sep 17 '15 11:09

Martin Senne

2 Answers

To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame:

import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")

And some way to get the max or whatever you want to memoize on the DataFrame:

val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)

sql.types.Metadata can only hold strings, booleans, some types of numbers, and other metadata structures. So we have to use a Long:

val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()

DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata):

val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)

dfWithMax now has (a column with) the metadata you want!

dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}

Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception):

dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992

Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers.

125

answered Sep 22 '22 20:09

chbrown

As of Spark 1.2, StructType schemas have a metadata attribute which can hold an arbitrary mapping / dictionary of information for each Column in a Dataframe. E.g. (when used with the separate spark-csv library):

customSchema = StructType([
  StructField("cat_id", IntegerType(), True,
    {'description': "Unique id, primary key"}),
  StructField("cat_title", StringType(), True,
    {'description': "Name of the category, with underscores"}) ])

categoryDumpDF = (sqlContext.read.format('com.databricks.spark.csv')
 .options(header='false')
 .load(csvFilename, schema = customSchema) )

f = categoryDumpDF.schema.fields
["%s (%s): %s" % (t.name, t.dataType, t.metadata) for t in f]

["cat_id (IntegerType): {u'description': u'Unique id, primary key'}",
 "cat_title (StringType): {u'description': u'Name of the category, with underscores.'}"]

This was added in [SPARK-3569] Add metadata field to StructField - ASF JIRA, and designed for use in Machine Learning pipelines to track information about the features stored in columns, like categorical/continuous, number categories, category-to-index map. See the SPARK-3569: Add metadata field to StructField design document.

I'd like to see this used more widely, e.g. for descriptions and documentation of columns, the unit of measurement used in the column, coordinate axis information, etc.

Issues include how to appropriately preserve or manipulate the metadata information when the column is transformed, how to handle multiple sorts of metadata, how to make it all extensible, etc.

For the benefit of those thinking of expanding this functionality in Spark dataframes, I reference some analogous discussions around Pandas.

For example, see xray - bring the labeled data power of pandas to the physical sciences which supports metadata for labeled arrays.

And see the discussion of metadata for Pandas at Allow custom metadata to be attached to panel/df/series? · Issue #2485 · pydata/pandas.

See also discussion related to units: ENH: unit of measurement / physical quantities · Issue #10349 · pydata/pandas

answered Sep 22 '22 20:09

nealmcb

Related questions
                            
                                Spark Scala: DateDiff of two columns by hour or minute
                            
                                How to create ADT in Haskell?
                            
                                Writing a generic cast function Scala
                            
                                State of XML support in Scala 2.9.x
                            
                                Function composition of methods, functions, and partially applied functions in Scala
                            
                                Scala: Parallel collection in object initializer causes a program to hang
                            
                                AbstractMethodError creating Kafka stream
                            
                                Are there any LL Parser Generators for Functional Languages such as Haskell or Scala?
                            
                                Where is `sequence` in Scalaz7
                            
                                Scala: "map" vs "foreach" - is there any reason to use "foreach" in practice?
                            
                                Converting nested case classes to nested Maps using Shapeless
                            
                                Scala - can yield be used multiple times with a for loop?
                            
                                why does filter have to be defined for pattern matching in a for loop in scala?
                            
                                Map on Scalaz Validation failure
                            
                                Changing Scala sources directory in SBT
                            
                                Is anyone using Scala in anger (and what advice for a Java programmer)? [closed]
                            
                                Case classes inheriting from abstract class
                            
                                How to turn off Netty library debug output?
                            
                                How does sbt choose which Scala version to use?
                            
                                How to get keys and values from MapType column in SparkSQL DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a way to add extra metadata for Spark dataframes?

Tags:

scala

apache-spark

apache-spark-sql