Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I increase decimal precision in Spark?

I have a large DataFrame made up of ~550 columns of doubles and two columns of longs (ids). The 550 columns are being read in from a csv, and I add two id columns. The only other things I do with the data is change some of the csv data from strings to doubles ("Inf" -> "0" then cast the column to double) and replace NaN's with 0:

df = df.withColumn(col.name + "temp", 
                             regexp_replace(
                                 regexp_replace(df(col.name),"Inf","0")
                                 ,"NaN","0").cast(DoubleType))
df = df.drop(col.name).withColumnRenamed(col.name + "temp",col.name)
df = df.withColumn("timeId", monotonically_increasing_id.cast(LongType))
df = df.withColumn("patId", lit(num).cast(LongType))
df = df.na.fill(0)

When I do a count, I get the following error:

IllegalArgumentException: requirement failed: Decimal precision 6 exceeds max precision 5

There are hundreds of thousands of rows, and I'm reading in the data from multiple csvs. How do I increase the decimal precision? Is there something else that could be going on? I am only getting this error when I read in some of the csvs. Could they have more decimals than the others?

like image 512
Ross Lewis Avatar asked May 31 '17 21:05

Ross Lewis


People also ask

How do you define decimals in spark?

Class DecimalTypeA Decimal that must have fixed precision (the maximum number of digits) and scale (the number of digits on right side of dot). The precision can be up to 38, scale can also be up to 38 (less or equal to precision). The default precision and scale is (10, 0).

How do you set decimal places in Pyspark?

You can use format_number to format a number to desired decimal places as stated in the official api document: Formats numeric column x to a format like '#,###,###. ##', rounded to d decimal places, and returns the result as a string column.

What is decimal of precision?

Precision is the number of digits in a number. Scale is the number of digits to the right of the decimal point in a number. For example, the number 123.45 has a precision of 5 and a scale of 2. In SQL Server, the default maximum precision of numeric and decimal data types is 38.

What is truncate false in spark?

The following answer applies to a Spark Streaming application. By setting the "truncate" option to false, you can tell the output sink to display the full column.


1 Answers

I think the error is pretty self explanatory- you need to be using a DecimalType not a DoubleType.

Try this:

...
.cast(DecimalType(6)))

Read on:

https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/types/DecimalType.html

http://spark.apache.org/docs/2.0.2/api/python/_modules/pyspark/sql/types.html

datatype for handling big numbers in pyspark

like image 135
rawkintrevo Avatar answered Oct 05 '22 15:10

rawkintrevo