Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark .toPandas() results in object column where expected numeric one

I extact data from our datawarehouse, store this in a parquet file and load all the parquet files into a spark dataframe. So far so good. However when I try to plot this using pandas.plot() function it throws me a "TypeError: Empty 'DataFrame': no numeric data to plot"

So I started investigating backwards to my source and I think the cast to decimal from my initial sql statement is one of the issues. But I have no clue how to fix this. I thought a fillna(0) would do the trick, but it doesn't.

STEP 1: Define the SQL statement to extract the data

mpr_sql = """
select 
CAST(DATE_KEY  AS INTEGER) AS DATE_KEY ,
CAST(AMD  AS INTEGER) AS AMD ,
CAST(AMD_2  AS DECIMAL(12,2)) AS AMD_2 ,
CAST(AMD_3  AS DECIMAL(12,2)) AS AMD_3 ,
CAST(AMD_4  AS DECIMAL(12,2)) AS AMD_4 ,
CAST(AMD_0  AS DECIMAL(12,2)) AS AMD_0 
"""

STEP 2: Create a spark dataframe from the extracted data

df1 = sqlContext.load(source="jdbc", 
                         driver="com.teradata.jdbc.TeraDriver", 
                         url=db_url,
                         user=db_user
                         TMODE="TERA",
                         password=db_pwd,
                         dbtable="( "+sql+") a")

STEP 3: Store the spark dataframe in a parquet file with 10 partitions

df1.coalesce(10).write.parquet("./mpr"+month+"sorted.parquet")
df = sqlContext.read.parquet('./mpr*sorted.parquet')

STEP 4: look at the spark dataframe schema (it shows decimal(12,2))

df.printSchema()
root
 |-- DATE_KEY: integer (nullable = true)
 |-- AMD:   integer (nullable = true)
 |-- AMD_2: decimal(12,2) (nullable = true)
 |-- AMD_3: decimal(12,2) (nullable = true)
 |-- AMD_4: decimal(12,2) (nullable = true)
 |-- AMD_0: decimal(12,2) (nullable = true)

STEP 5: convert the spark dataframe into a pandas dataframe and replace any Nulls by 0 (with the fillna(0))

pdf=df.fillna(0).toPandas()

STEP 6: look at the pandas dataframe info for the relevant columns. AMD is correct (integer), but AMD_4 is of type object where I expected a double or float or something like that (sorry always forget the right type). And since AMD_4 is a non numeric type, I can not use it to be plotted.

pdf[['AMD','AMD4']].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 20140101 to 20150801
Data columns (total 2 columns):
AMD         20 non-null int64
AMD_4       20 non-null object
dtypes: int64(1), object(1)
memory usage: 480.0+ bytes

So my questions are:

  1. Why is the AMD_4 (and the other AMD_x columns not shown here) of type object, while AMD is of typ int64?
  2. Or in other words how can I make the AMD_x columns in a float/double/decimal kind of type
like image 312
Geoffrey Stoel Avatar asked Nov 02 '15 15:11

Geoffrey Stoel


People also ask

How do you get top 5 values in PySpark?

In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).

How do you select col1 and col2 from PySpark DataFrame DF?

In spark you can select columns by giving their names as strings (but only if they are unique in current dataframe). You have to pass columns like df. select("col1", "col2") rather than df. select(["col1", "col2"]) .

How do you use PyArrow in PySpark?

Ensure PyArrow InstalledIf you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the SQL module with the command pip install pyspark[sql] . Otherwise, you must ensure that PyArrow is installed and available on all cluster nodes. The current supported version is 0.8. 0.

What is ILOC in PySpark?

. iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a conditional boolean Series. Allowed inputs are: An integer for column selection, e.g. 5 . A list or array of integers for row selection with distinct index values, e.g. [3, 4, 0]


2 Answers

I had the same problem and then I figured out what was the reason.

During the conversion, there is a coalesce of data types, such as int/long -> int64, double -> float64, string->obj. For all unknown data types, it will be converted to obj type.

In Pandas data frame, there is no decimal data type, so all columns of decimal data type are converted to obj type.

If you can convert all decimal data type to double type before applying toPandas(), you will have all numerical data ready to use.

from pyspark.sql.functions import *
from pyspark.sql.types import *
df = df.withColumn('AMD_4', col('AMD_4').cast(DoubleType())).withColumn('AMD_2', col('AMD_2').cast(DoubleType()))
pdf = df.toPandas()

In the pdf, the AMD_4 and AMD_2 will be numerical now.

like image 174
Gary Liu Avatar answered Nov 14 '22 22:11

Gary Liu


First check pdf.isnull().sum():
1.It should be all zero. For some reason, if some column count returns na or nan, you can always use pandas fillna(),

pdf = df.fillna(0).toPandas()
pdf = pdf.fillna(0)

or

pdf=df.toPandas().fillna(0)

2.If all were zeros then, check where is type mismatch with,

pdf.applymap(lambda x: isinstance(x, (int, float)))  

And correct it

like image 43
WoodChopper Avatar answered Nov 15 '22 00:11

WoodChopper