I have a DataFrame df
which has this schema:
root
|-- person.name: string (nullable = true)
|-- person: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
When I do df.select("person.name")
I obviously fetch the sub-field name
from person
. How could I select the column person.name
?
Yes, column names for dataframes are case sensitive. Dataframe column names are essentially string values, which are case sensitive in Python.
caseSensitive is set to false , Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values.
Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.
The following code shows how to create a pandas DataFrame with specific column names and no rows: import pandas as pd #create DataFrame df = pd.DataFrame(columns= ['A', 'B', 'C', 'D', 'E']) #view DataFrame df A B C D E
Using Column Name with Dot on select (). Using Column Name with Dot on withColumn () Have a column name with a dot leads us into confusion as in PySpark/Spark dot notation is used to refer to the nested column of the struct type. so if possible try to replace all column names with dot to underscore before processing it.
Here are two approaches to get a list of all the column names in Pandas DataFrame: First approach: my_list = list(df) Second approach: my_list = df.columns.values.tolist() Later you’ll also see which approach is the fastest to use. The Example. To start with a simple example, let’s create a DataFrame with 3 columns:
In order to access PySpark/Spark DataFrame Column Name with a dot from wihtColumn () & select (), you just need to enclose the column name with backticks (`) Using Column Name with Dot on select (). Using Column Name with Dot on withColumn ()
For the column name that contains .(dot)
you can use the `
character to enclose the column name
df.select("`person.name`")
This selects the outer String person.name: string (nullable = true)
And
df.select("person.name")
This gets the person name which is struct
|-- person: struct (nullable = true)
| |-- age: long (nullable = true)
If you have a column name you can just prepend and append ` character for the column name as
"`" + columnName + "`"
I hope this was helpful!
My answer provides a working code snippet that illustrates the problem of having dots in column names and explains how you can easily remove dots from column names.
Let's create a DataFrame with some sample data:
schema = StructType([
StructField("person.name", StringType(), True),
StructField("person", StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)]))
])
data = [
("charles", Row("chuck", 42)),
("larry", Row("chipper", 48))
]
df = spark.createDataFrame(data, schema)
df.show()
+-----------+-------------+
|person.name| person|
+-----------+-------------+
| charles| [chuck, 42]|
| larry|[chipper, 48]|
+-----------+-------------+
Let's illustrate that selecting person.name
will return different results depending on if backticks are used or not.
cols = ["person.name", "person", "person.name", "`person.name`"]
df.select(cols).show()
+-----+-----------+-----+-----------+
| name| person| name|person.name|
+-----+-----------+-----+-----------+
|chuck|[chuck, 42]|chuck| charles|
|larry|[larry, 73]|larry| lawrence|
+-----+-----------+-----+-----------+
You definitely don't want to write or maintain code that changes results based on the presence of backticks. It's always better to replace all the dots with underscores when starting the analysis.
clean_df = df.toDF(*(c.replace('.', '_') for c in df.columns))
clean_df.select("person_name", "person.name", "person.age").show()
+-----------+-----+---+
|person_name| name|age|
+-----------+-----+---+
| charles|chuck| 42|
| lawrence|larry| 73|
+-----------+-----+---+
This post explains how and why to avoid dots in PySpark columns names in more detail.
To access the column name
with a period using pyspark, do this:
spark.sql("select person.name from person_table")
Note: person_table is a registerTempTable on df.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With