I extact data from our datawarehouse, store this in a parquet file and load all the parquet files into a spark dataframe. So far so good. However when I try to plot this using pandas.plot() function it throws me a "TypeError: Empty 'DataFrame': no numeric data to plot"
So I started investigating backwards to my source and I think the cast to decimal from my initial sql statement is one of the issues. But I have no clue how to fix this. I thought a fillna(0) would do the trick, but it doesn't.
STEP 1: Define the SQL statement to extract the data
mpr_sql = """
select
CAST(DATE_KEY AS INTEGER) AS DATE_KEY ,
CAST(AMD AS INTEGER) AS AMD ,
CAST(AMD_2 AS DECIMAL(12,2)) AS AMD_2 ,
CAST(AMD_3 AS DECIMAL(12,2)) AS AMD_3 ,
CAST(AMD_4 AS DECIMAL(12,2)) AS AMD_4 ,
CAST(AMD_0 AS DECIMAL(12,2)) AS AMD_0
"""
STEP 2: Create a spark dataframe from the extracted data
df1 = sqlContext.load(source="jdbc",
driver="com.teradata.jdbc.TeraDriver",
url=db_url,
user=db_user
TMODE="TERA",
password=db_pwd,
dbtable="( "+sql+") a")
STEP 3: Store the spark dataframe in a parquet file with 10 partitions
df1.coalesce(10).write.parquet("./mpr"+month+"sorted.parquet")
df = sqlContext.read.parquet('./mpr*sorted.parquet')
STEP 4: look at the spark dataframe schema (it shows decimal(12,2))
df.printSchema()
root
|-- DATE_KEY: integer (nullable = true)
|-- AMD: integer (nullable = true)
|-- AMD_2: decimal(12,2) (nullable = true)
|-- AMD_3: decimal(12,2) (nullable = true)
|-- AMD_4: decimal(12,2) (nullable = true)
|-- AMD_0: decimal(12,2) (nullable = true)
STEP 5: convert the spark dataframe into a pandas dataframe and replace any Nulls by 0 (with the fillna(0))
pdf=df.fillna(0).toPandas()
STEP 6: look at the pandas dataframe info for the relevant columns. AMD is correct (integer), but AMD_4 is of type object where I expected a double or float or something like that (sorry always forget the right type). And since AMD_4 is a non numeric type, I can not use it to be plotted.
pdf[['AMD','AMD4']].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 20140101 to 20150801
Data columns (total 2 columns):
AMD 20 non-null int64
AMD_4 20 non-null object
dtypes: int64(1), object(1)
memory usage: 480.0+ bytes
So my questions are:
In Spark/PySpark, you can use show() action to get the top/first N (5,10,100 ..) rows of the DataFrame and display them on a console or a log, there are also several Spark Actions like take() , tail() , collect() , head() , first() that return top and last n rows as a list of Rows (Array[Row] for Scala).
In spark you can select columns by giving their names as strings (but only if they are unique in current dataframe). You have to pass columns like df. select("col1", "col2") rather than df. select(["col1", "col2"]) .
Ensure PyArrow InstalledIf you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the SQL module with the command pip install pyspark[sql] . Otherwise, you must ensure that PyArrow is installed and available on all cluster nodes. The current supported version is 0.8. 0.
. iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a conditional boolean Series. Allowed inputs are: An integer for column selection, e.g. 5 . A list or array of integers for row selection with distinct index values, e.g. [3, 4, 0]
I had the same problem and then I figured out what was the reason.
During the conversion, there is a coalesce of data types, such as int/long -> int64, double -> float64, string->obj. For all unknown data types, it will be converted to obj type.
In Pandas data frame, there is no decimal data type, so all columns of decimal data type are converted to obj type.
If you can convert all decimal data type to double type before applying toPandas(), you will have all numerical data ready to use.
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = df.withColumn('AMD_4', col('AMD_4').cast(DoubleType())).withColumn('AMD_2', col('AMD_2').cast(DoubleType()))
pdf = df.toPandas()
In the pdf, the AMD_4 and AMD_2 will be numerical now.
First check pdf.isnull().sum()
:
1.It should be all zero. For some reason, if some column count returns na or nan, you can always use pandas fillna()
,
pdf = df.fillna(0).toPandas()
pdf = pdf.fillna(0)
or
pdf=df.toPandas().fillna(0)
2.If all were zeros then, check where is type mismatch with,
pdf.applymap(lambda x: isinstance(x, (int, float)))
And correct it
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With