DataFrame columns names conflict with .(dot)

Tags:

I have a DataFrame df which has this schema:

root
 |-- person.name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)

When I do df.select("person.name") I obviously fetch the sub-field name from person. How could I select the column person.name?

366

asked Feb 28 '18 14:02

belka

3 Answers

For the column name that contains .(dot) you can use the ` character to enclose the column name

df.select("`person.name`")

This selects the outer String person.name: string (nullable = true)

And df.select("person.name")

This gets the person name which is struct

 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)

If you have a column name you can just prepend and append ` character for the column name as

"`" + columnName + "`"

I hope this was helpful!

answered Oct 22 '22 08:10

koiralo

My answer provides a working code snippet that illustrates the problem of having dots in column names and explains how you can easily remove dots from column names.

Let's create a DataFrame with some sample data:

schema = StructType([
    StructField("person.name", StringType(), True),
    StructField("person", StructType([
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True)]))
])
data = [
    ("charles", Row("chuck", 42)),
    ("larry", Row("chipper", 48))
]
df = spark.createDataFrame(data, schema)
df.show()

+-----------+-------------+
|person.name|       person|
+-----------+-------------+
|    charles|  [chuck, 42]|
|      larry|[chipper, 48]|
+-----------+-------------+

Let's illustrate that selecting person.name will return different results depending on if backticks are used or not.

cols = ["person.name", "person", "person.name", "`person.name`"]
df.select(cols).show()

+-----+-----------+-----+-----------+
| name|     person| name|person.name|
+-----+-----------+-----+-----------+
|chuck|[chuck, 42]|chuck|    charles|
|larry|[larry, 73]|larry|   lawrence|
+-----+-----------+-----+-----------+

You definitely don't want to write or maintain code that changes results based on the presence of backticks. It's always better to replace all the dots with underscores when starting the analysis.

clean_df = df.toDF(*(c.replace('.', '_') for c in df.columns))
clean_df.select("person_name", "person.name", "person.age").show()

+-----------+-----+---+
|person_name| name|age|
+-----------+-----+---+
|    charles|chuck| 42|
|   lawrence|larry| 73|
+-----------+-----+---+

This post explains how and why to avoid dots in PySpark columns names in more detail.

answered Oct 22 '22 08:10

Powers

To access the column name with a period using pyspark, do this:

spark.sql("select person.name from person_table")

Note: person_table is a registerTempTable on df.

answered Oct 22 '22 09:10

Srini GL

Related questions
                            
                                Convert Seq or List to collection.immutable.Queue
                            
                                Is there any difference between flatten and flatMap(identity)?
                            
                                What is the difference between Clojure REPL and Scala REPL?
                            
                                Scala sort one list according to the values in another one
                            
                                Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?
                            
                                Scala: what is trait TraversableOnce? What's the different between TraversableOnce and Traversable?
                            
                                How to query the column names of a Spark Dataset?
                            
                                Is it okay that NonFatal catches Throwable?
                            
                                fold list of tuples in scala with destructuring
                            
                                Ternary operator typing
                            
                                Multiple parameter closure argument type not inferred
                            
                                Replace " with \"
                            
                                Function literal with multiple implicit arguments
                            
                                Catch in Java a exception thrown in Scala - unreachable catch block
                            
                                Scala, Erastothenes: Is there a straightforward way to replace a stream with an iteration?
                            
                                Typed Function and Currying in Scala
                            
                                Akka Http Route Test: Request was neither completed nor rejected within 1 second
                            
                                How to best handle Future.filter predicate is not satisfied type errors
                            
                                How to use LEFT and RIGHT keyword in SPARK SQL
                            
                                Oracle jdbc "createArray" throws "Unsupported feature" exception while trying to pass array to prepared statement [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

DataFrame columns names conflict with .(dot)

Tags:

scala

apache-spark

apache-spark-sql