Pandas dataframe in pyspark to hive

Question

How to send a pandas dataframe to a hive table?

I know if I have a spark dataframe, I can register it to a temporary table using

df.registerTempTable("table_name")
sqlContext.sql("create table table_name2 as select * from table_name")

but when I try to use the pandas dataFrame to registerTempTable, I get the below error:

AttributeError: 'DataFrame' object has no attribute 'registerTempTable'

Is there a way for me to use a pandas dataFrame to register a temp table or convert it to a spark dataFrame and then use it register a temp table so that I can send it back to hive.

MaxU - stop WAR against UA · Accepted Answer

I guess you are trying to use pandas df instead of Spark's DF.

Pandas DataFrame has no such method as registerTempTable.

you may try to create Spark DF from pandas DF.

UPDATE:

I've tested it under Cloudera (with installed Anaconda parcel, which includes Pandas module).

Make sure that you have set PYSPARK_PYTHON to your anaconda python installation (or another one containing Pandas module) on all your Spark workers (usually in: spark-conf/spark-env.sh)

Here is result of my test:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=list('ABC'))
>>> sdf = sqlContext.createDataFrame(df)
>>> sdf.show()
+---+---+---+
|  A|  B|  C|
+---+---+---+
| 98| 33| 75|
| 91| 57| 80|
| 20| 87| 85|
| 20| 61| 37|
| 96| 64| 60|
| 79| 45| 82|
| 82| 16| 22|
| 77| 34| 65|
| 74| 18| 17|
| 71| 57| 60|
+---+---+---+

>>> sdf.printSchema()
root
 |-- A: long (nullable = true)
 |-- B: long (nullable = true)
 |-- C: long (nullable = true)

Ming.Xu · Answer

first u need to convert pandas dataframe to spark dataframe:

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
df = hive_context.createDataFrame(pd_df)

then u can create a temptable which is in memory:

df.registerTempTable('tmp')

now,u can use hive ql to save data into hive:

hive_context.sql("""insert overwrite table target partition(p='p') select a,b from tmp'''

note than:the hive_context must be keep to the same one!

Pandas dataframe in pyspark to hive

Tags:

pandas

python-2.7

pyspark

hive

thenakulchawla

2 Answers

MaxU - stop WAR against UA

Ming.Xu

Recent Activity

Donate For Us

Pandas dataframe in pyspark to hive

Tags:

pandas

python-2.7

pyspark

hive

thenakulchawla

2 Answers

MaxU - stop WAR against UA

Ming.Xu

Related questions

Recent Activity

Donate For Us