I have a numpy matrix:
arr = np.array([[2,3], [2,8], [2,3],[4,5]])
I need to create a PySpark Dataframe from arr
. I can not manually input the values because the length/values of arr
will be changing dynamically so I need to convert arr
into a dataframe.
I tried the following code to no success.
df= sqlContext.createDataFrame(arr,["A", "B"])
However, I get the following error.
TypeError: Can not infer schema for type: <type 'numpy.ndarray'>
import numpy as np
#sample data
arr = np.array([[2,3], [2,8], [2,3],[4,5]])
rdd1 = sc.parallelize(arr)
rdd2 = rdd1.map(lambda x: [int(i) for i in x])
df = rdd2.toDF(["A", "B"])
df.show()
Output is:
+---+---+
| A| B|
+---+---+
| 2| 3|
| 2| 8|
| 2| 3|
| 4| 5|
+---+---+
No need to use the RDD API. Simply:
mat = np.random.random((10,3))
cols = ["ColA","ColB","ColC"]
df = spark.createDataFrame(mat.tolist(), cols)
df.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With