How to detect if decimal columns should be converted into integer or double?

Tags:

I use Apache spark as an ETL tool to fetch tables from Oracle into Elasticsearch.

I face an issue with numeric columns that spark recognize them as decimal whereas Elasticsearch doesn't accept decimal type; so i convert each decimal columns into double which is accepted for Elasticsearch.

dataFrame = dataFrame.select(
    [col(name) if 'decimal' not in colType else col(name).cast('double') for name, colType in dataFrame.dtypes]
)

The current issue that every numeric column will be double; either it has decimal value or not.

My question is there any way to detect column type should be converted into either integer type or double?

391

asked Sep 22 '19 11:09

Nimer Esam

1 Answers

You can retrieve all column names with datatype == DecimalType() from the schema of the dataframe, see below for an example (tested on Spark 2.4.0):

Update: just use df.dtypes which is enough to retrieve the information.

from pyspark.sql.functions import col

df = spark.createDataFrame([ (1, 12.3, 1.5, 'test', 13.23) ], ['i1', 'd2', 'f3', 's4', 'd5'])

df = df.withColumn('d2', col('d2').astype('decimal(10,1)')) \
       .withColumn('d5', col('d5').astype('decimal(10,2)'))
#DataFrame[i1: bigint, d2: decimal(10,1), f3: double, s4: string, d5: decimal(10,2)]

decimal_cols = [ f[0] for f in df.dtypes if f[1].startswith('decimal') ]

print(decimal_cols)
['d2', 'd5']

Just a follow-up: the above method will not work for array, struct and nested data structures. If the field names in struct don't contain characters like spaces, dot etc, you can use the type from the df.dtypes directly.

import re
from pyspark.sql.functions import array, struct, col

decimal_to_double = lambda x: re.sub(r'decimal\(\d+,\d+\)', 'double', x)

df1 = df.withColumn('a6', array('d2','d5')).withColumn('s7', struct('i1','d2'))
# DataFrame[i1: bigint, d2: decimal(10,1), l3: double, s4: string, d5: decimal(10,2), a6: array<decimal(11,2)>, s7: struct<i1:bigint,d2:decimal(10,1)>]

df1.select(*[ col(d[0]).astype(decimal_to_double(d[1])) if 'decimal' in d[1] else col(d[0]) for d in df1.dtypes ])
# DataFrame[i1: bigint, d2: double, l3: double, s4: string, d5: double, a6: array<double>, s7: struct<i1:bigint,d2:double>]

However, if any field-names of StructType() contain spaces, dot etc. the above method might not be working. In such case, I suggest you check: df.schema.jsonValue()['fields'] to retrieve and manipulate JSON data to do the dtype transformation.

148

answered Sep 20 '22 16:09

jxc

Related questions
                            
                                Matplotlib plot from Python script not showing up in output when run in Jupyter Notebook
                            
                                pandas int or float column to percentage distribution
                            
                                How to use pathlib.Path.expanduser() and amend and use a PosixPath value?
                            
                                How SelectKBest (chi2) calculates score?
                            
                                Pandas str.split without stripping split pattern
                            
                                tf.keras.layers.pop() doesn't work, but tf.keras._layers.pop() does
                            
                                Using Typing and Mypy with Descriptors
                            
                                Python comparison operator precedence
                            
                                Filling torch tensor with zeros after certain index
                            
                                Storing Spotify token in flask session using spotipy?
                            
                                How to get Index of an Entity in a Sentence in Spacy?
                            
                                How can I ignore certain values when comparing dictionaries in unittest?
                            
                                Why doesn't "is not None" work with dataframe.loc, but "!= None" works fine?
                            
                                VSCode Python version defaults to 2.7 in the integrated terminal no matter what I do [duplicate]
                            
                                SQLAlchemy: group by day over multiple tables
                            
                                How to fix ' KeyError: 'accuracy' ' when running flowers_tf_lite.ipynb?
                            
                                Python: take screenshot from video
                            
                                Change legend position using holoviews / hvplot
                            
                                How to create a wordcloud according to frequencies in a pandas dataframe
                            
                                How do I get retry handling with python zeep? I'm using a requests retry session, but the exception is not handled

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to detect if decimal columns should be converted into integer or double?

Tags:

python

elasticsearch

pyspark

pyspark-sql

pyspark-dataframes

Nimer Esam

People also ask

1 Answers

jxc

Recent Activity

Donate For Us