extracting numpy array from Pyspark Dataframe

Tags:

I have a dataframe gi_man_df where group can be n:

+------------------+-----------------+--------+--------------+
|           group  |           number|rand_int|   rand_double|
+------------------+-----------------+--------+--------------+
|          'GI_MAN'|                7|       3|         124.2|
|          'GI_MAN'|                7|      10|        121.15|
|          'GI_MAN'|                7|      11|         129.0|
|          'GI_MAN'|                7|      12|         125.0|
|          'GI_MAN'|                7|      13|         125.0|
|          'GI_MAN'|                7|      21|         127.0|
|          'GI_MAN'|                7|      22|         126.0|
+------------------+-----------------+--------+--------------+

and I am expecting a numpy nd_array i.e, gi_man_array:

[[[124.2],[121.15],[129.0],[125.0],[125.0],[127.0],[126.0]]]

where rand_double values after applying pivot.

I tried the following 2 approaches:
FIRST: I pivot the gi_man_df as follows:

gi_man_pivot = gi_man_df.groupBy("number").pivot('rand_int').sum("rand_double")

and the output I got is:

Row(number=7, group=u'GI_MAN', 3=124.2, 10=121.15, 11=129.0, 12=125.0, 13=125.0, 21=127.0, 23=126.0)

but here the problem is to get the desired output, I can't convert it to matrix then convert again to numpy array.

SECOND: I created the vector in the dataframe itself using:

assembler = VectorAssembler(inputCols=["rand_double"],outputCol="rand_double_vector")

gi_man_vector = assembler.transform(gi_man_df)
gi_man_vector.show(7)

and I got the following output:

+----------------+-----------------+--------+--------------+--------------+
|           group|           number|rand_int|   rand_double| rand_dbl_Vect|
+----------------+-----------------+--------+--------------+--------------+
|          GI_MAN|                7|       3|         124.2|       [124.2]|
|          GI_MAN|                7|      10|        121.15|      [121.15]|
|          GI_MAN|                7|      11|         129.0|       [129.0]|
|          GI_MAN|                7|      12|         125.0|       [125.0]|
|          GI_MAN|                7|      13|         125.0|       [125.0]|
|          GI_MAN|                7|      21|         127.0|       [127.0]|
|          GI_MAN|                7|      22|         126.0|       [126.0]|
+----------------+-----------------+--------+--------------+--------------+

but problem here is I can't pivot it on rand_dbl_Vect.

So my question is:
1. Is any of the 2 approaches is correct way of achieving the desired output, if so then how can I proceed further to get the desired result?
2. What other way I can proceed with so the code is optimal and performance is good?

338

asked Feb 08 '17 14:02

Uday Shankar Singh

1 Answers

This

import numpy as np
np.array(gi_man_df.select('rand_double').collect())

produces

array([[ 124.2 ],
       [ 121.15],
       .........])

194

answered Sep 27 '22 16:09

data_steve

Related questions
                            
                                Evaluate all pair combinations of rows of two tensors in tensorflow
                            
                                How to load an image and show the image using keras?
                            
                                Can i set float128 as the standard float-array in numpy
                            
                                numpy.shape gives inconsistent responses - why?
                            
                                Why does numpy.r_ use brackets instead of parentheses?
                            
                                Precision of numpy array lost after tolist
                            
                                Creating same random number sequence in Python, NumPy and R
                            
                                np.isnan on arrays of dtype "object"
                            
                                numpy 1D array: mask elements that repeat more than n times
                            
                                matplotlib show() doesn't work twice
                            
                                How to import NumPy in the Python shell
                            
                                Avoid overflow when adding numpy arrays
                            
                                Finding the index of a numpy array in a list
                            
                                Build a basic cube with numpy?
                            
                                NumPy types with underscore: `int_`, `float_`, etc
                            
                                How to eliminate the extra minus sign when rounding negative numbers towards zero in numpy?
                            
                                Flatten numpy array with sub-arrays of different dimensions
                            
                                Numpy, BLAS and CUBLAS
                            
                                Is it possible to np.concatenate memory-mapped files?
                            
                                geodesic distance transform in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

extracting numpy array from Pyspark Dataframe

Tags:

numpy

apache-spark

pyspark

spark-dataframe

apache-spark-mllib

Uday Shankar Singh

People also ask

1 Answers

data_steve

Recent Activity

Donate For Us