I have a large DataFrame made up of ~550 columns of doubles and two columns of longs (ids). The 550 columns are being read in from a csv, and I add two id columns. The only other things I do with the data is change some of the csv data from strings to doubles ("Inf" -> "0" then cast the column to double) and replace NaN's with 0:
df = df.withColumn(col.name + "temp",
regexp_replace(
regexp_replace(df(col.name),"Inf","0")
,"NaN","0").cast(DoubleType))
df = df.drop(col.name).withColumnRenamed(col.name + "temp",col.name)
df = df.withColumn("timeId", monotonically_increasing_id.cast(LongType))
df = df.withColumn("patId", lit(num).cast(LongType))
df = df.na.fill(0)
When I do a count, I get the following error:
IllegalArgumentException: requirement failed: Decimal precision 6 exceeds max precision 5
There are hundreds of thousands of rows, and I'm reading in the data from multiple csvs. How do I increase the decimal precision? Is there something else that could be going on? I am only getting this error when I read in some of the csvs. Could they have more decimals than the others?
Class DecimalTypeA Decimal that must have fixed precision (the maximum number of digits) and scale (the number of digits on right side of dot). The precision can be up to 38, scale can also be up to 38 (less or equal to precision). The default precision and scale is (10, 0).
You can use format_number to format a number to desired decimal places as stated in the official api document: Formats numeric column x to a format like '#,###,###. ##', rounded to d decimal places, and returns the result as a string column.
Precision is the number of digits in a number. Scale is the number of digits to the right of the decimal point in a number. For example, the number 123.45 has a precision of 5 and a scale of 2. In SQL Server, the default maximum precision of numeric and decimal data types is 38.
The following answer applies to a Spark Streaming application. By setting the "truncate" option to false, you can tell the output sink to display the full column.
I think the error is pretty self explanatory- you need to be using a DecimalType
not a DoubleType
.
Try this:
...
.cast(DecimalType(6)))
Read on:
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/types/DecimalType.html
http://spark.apache.org/docs/2.0.2/api/python/_modules/pyspark/sql/types.html
datatype for handling big numbers in pyspark
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With