Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DataFrame columns names conflict with .(dot)

I have a DataFrame df which has this schema:

root
 |-- person.name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)

When I do df.select("person.name") I obviously fetch the sub-field name from person. How could I select the column person.name?

like image 366
belka Avatar asked Feb 28 '18 14:02

belka


People also ask

Are DataFrame column names case sensitive?

Yes, column names for dataframes are case sensitive. Dataframe column names are essentially string values, which are case sensitive in Python.

Is Spark case sensitive for column names?

caseSensitive is set to false , Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values.

Can pandas have same column names?

Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.

How to create a Dataframe with specific column names and no rows?

The following code shows how to create a pandas DataFrame with specific column names and no rows: import pandas as pd #create DataFrame df = pd.DataFrame(columns= ['A', 'B', 'C', 'D', 'E']) #view DataFrame df A B C D E

How do you use a dot in a column name?

Using Column Name with Dot on select (). Using Column Name with Dot on withColumn () Have a column name with a dot leads us into confusion as in PySpark/Spark dot notation is used to refer to the nested column of the struct type. so if possible try to replace all column names with dot to underscore before processing it.

How to get all column names in a pandas Dataframe?

Here are two approaches to get a list of all the column names in Pandas DataFrame: First approach: my_list = list(df) Second approach: my_list = df.columns.values.tolist() Later you’ll also see which approach is the fastest to use. The Example. To start with a simple example, let’s create a DataFrame with 3 columns:

How to access pyspark/spark dataframe column name with a dot?

In order to access PySpark/Spark DataFrame Column Name with a dot from wihtColumn () & select (), you just need to enclose the column name with backticks (`) Using Column Name with Dot on select (). Using Column Name with Dot on withColumn ()


3 Answers

For the column name that contains .(dot) you can use the ` character to enclose the column name

df.select("`person.name`") 

This selects the outer String person.name: string (nullable = true)

And df.select("person.name")

This gets the person name which is struct

 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)

If you have a column name you can just prepend and append ` character for the column name as

"`" + columnName + "`"

I hope this was helpful!

like image 96
koiralo Avatar answered Oct 22 '22 08:10

koiralo


My answer provides a working code snippet that illustrates the problem of having dots in column names and explains how you can easily remove dots from column names.

Let's create a DataFrame with some sample data:

schema = StructType([
    StructField("person.name", StringType(), True),
    StructField("person", StructType([
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True)]))
])
data = [
    ("charles", Row("chuck", 42)),
    ("larry", Row("chipper", 48))
]
df = spark.createDataFrame(data, schema)
df.show()
+-----------+-------------+
|person.name|       person|
+-----------+-------------+
|    charles|  [chuck, 42]|
|      larry|[chipper, 48]|
+-----------+-------------+

Let's illustrate that selecting person.name will return different results depending on if backticks are used or not.

cols = ["person.name", "person", "person.name", "`person.name`"]
df.select(cols).show()
+-----+-----------+-----+-----------+
| name|     person| name|person.name|
+-----+-----------+-----+-----------+
|chuck|[chuck, 42]|chuck|    charles|
|larry|[larry, 73]|larry|   lawrence|
+-----+-----------+-----+-----------+

You definitely don't want to write or maintain code that changes results based on the presence of backticks. It's always better to replace all the dots with underscores when starting the analysis.

clean_df = df.toDF(*(c.replace('.', '_') for c in df.columns))
clean_df.select("person_name", "person.name", "person.age").show()
+-----------+-----+---+
|person_name| name|age|
+-----------+-----+---+
|    charles|chuck| 42|
|   lawrence|larry| 73|
+-----------+-----+---+

This post explains how and why to avoid dots in PySpark columns names in more detail.

like image 42
Powers Avatar answered Oct 22 '22 08:10

Powers


To access the column name with a period using pyspark, do this:

spark.sql("select person.name from person_table")

Note: person_table is a registerTempTable on df.

like image 35
Srini GL Avatar answered Oct 22 '22 09:10

Srini GL