Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert spark sql dataframe to numpy array?

I'm using pyspark and imported a hive table into a dataframe.

df = sqlContext.sql("from hive_table select *") 

I need help on converting this df to numpy array. You may assume hive_table has only one column.

Can you please suggest? Thank you in advance.

like image 437
user2763088 Avatar asked May 16 '26 22:05

user2763088


1 Answers

You can:

sqlContext.range(0, 10).toPandas().values  # .reshape(-1) for 1d array
array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

but it is unlikely you really want to. Created array will be local to the driver node so it its rarely useful. If you're looking for some variant of distributed array-like data structure there is a number of possible choices in Apache Spark:

  • pyspark.mllib.linalg.distributed which provides a number of distributed matrix classes.
  • sparkit-learn ArrayRDD.

and independent of Apache Spark:

  • Dask dask.array.
like image 160
zero323 Avatar answered May 18 '26 10:05

zero323



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!