I have a dataframe gi_man_df where group can be n:
+------------------+-----------------+--------+--------------+
| group | number|rand_int| rand_double|
+------------------+-----------------+--------+--------------+
| 'GI_MAN'| 7| 3| 124.2|
| 'GI_MAN'| 7| 10| 121.15|
| 'GI_MAN'| 7| 11| 129.0|
| 'GI_MAN'| 7| 12| 125.0|
| 'GI_MAN'| 7| 13| 125.0|
| 'GI_MAN'| 7| 21| 127.0|
| 'GI_MAN'| 7| 22| 126.0|
+------------------+-----------------+--------+--------------+
and I am expecting a numpy nd_array i.e, gi_man_array:
[[[124.2],[121.15],[129.0],[125.0],[125.0],[127.0],[126.0]]]
where rand_double values after applying pivot.
I tried the following 2 approaches:
FIRST: I pivot the gi_man_df as follows:
gi_man_pivot = gi_man_df.groupBy("number").pivot('rand_int').sum("rand_double")
and the output I got is:
Row(number=7, group=u'GI_MAN', 3=124.2, 10=121.15, 11=129.0, 12=125.0, 13=125.0, 21=127.0, 23=126.0)
but here the problem is to get the desired output, I can't convert it to matrix then convert again to numpy array.
SECOND: I created the vector in the dataframe itself using:
assembler = VectorAssembler(inputCols=["rand_double"],outputCol="rand_double_vector")
gi_man_vector = assembler.transform(gi_man_df)
gi_man_vector.show(7)
and I got the following output:
+----------------+-----------------+--------+--------------+--------------+
| group| number|rand_int| rand_double| rand_dbl_Vect|
+----------------+-----------------+--------+--------------+--------------+
| GI_MAN| 7| 3| 124.2| [124.2]|
| GI_MAN| 7| 10| 121.15| [121.15]|
| GI_MAN| 7| 11| 129.0| [129.0]|
| GI_MAN| 7| 12| 125.0| [125.0]|
| GI_MAN| 7| 13| 125.0| [125.0]|
| GI_MAN| 7| 21| 127.0| [127.0]|
| GI_MAN| 7| 22| 126.0| [126.0]|
+----------------+-----------------+--------+--------------+--------------+
but problem here is I can't pivot it on rand_dbl_Vect.
So my question is:
1. Is any of the 2 approaches is correct way of achieving the desired output, if so then how can I proceed further to get the desired result?
2. What other way I can proceed with so the code is optimal and performance is good?
You can convert pandas DataFrame to Numpy array by using to_numpy() , to_records() , index() , and values() methods.
pyspark.sql.functions. explode (col)[source] Returns a new row for each element in the given array or map. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.
This
import numpy as np
np.array(gi_man_df.select('rand_double').collect())
produces
array([[ 124.2 ],
[ 121.15],
.........])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With