Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark dataframe column naming conventions / restrictions

I have run into issues with the default naming (as imported from .csv files received) of my (Py)Spark column names multiple times now. Things that seem to mess with Spark are MixedCase and things like . or - in the column names. So I decided to find out what column names are actually save, and found the following:

This website seems to advise for lowercase only names:

Hive stores the table, field names in lowercase in Hive Metastore. Spark preserves the case of the field name in Dataframe, Parquet Files. When a table is created/accessed using Spark SQL, Case Sensitivity is preserved by Spark storing the details in Table Properties (in hive metastore). This results in a weird behavior when parquet records are accessed thru Spark SQL using Hive Metastore.

Amazon Athena seems to confirm this, and adds that "_" is the only save special character:

... but Spark requires lowercase table and column names.

Athena table, view, database, and column names cannot contain special characters, other than underscore (_).

What I take from this is that I should, if in any way possible, try to only have lowercase column names, with _ as separator between words to ensure maximum crosscompatibility with tools that might appear in my Spark workflow. Is this correct? Is there reason to prefer a space over an underscore, is there anything else to consider?

I realize that in many cases, I might be overdoing it when renaming all columns to above schema - however, I'd rather avoid running into naming-related troubles in the middle of my project, since I find these errors hard to debug sometimes.

like image 950
Thomas Avatar asked Oct 26 '18 14:10

Thomas


1 Answers

When saving the file to Parquet format, you cannot use spaces and some specific characters. I ran into similar problems reading from CSV and writing to Parquet. The following code solved it for me:

# Column headers: lower case + remove spaces and the following characters: ,;{}()=  
newColumns = []
problematic_chars = ',;{}()='
for column in df.columns:
    column = column.lower()
    column = column.replace(' ', '_')
    for c in problematic_chars:
        column = column.replace(c, '')
    newColumns.append(column)
df = df.toDF(*newColumns)

So yes, if your goal is to ensure maximum cross compatibility, you should make sure your column names are all lowercase, with only _ as separator.

like image 103
PythonSherpa Avatar answered Nov 04 '22 00:11

PythonSherpa