Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark - Create DataFrame from Numpy Matrix

I have a numpy matrix:

arr = np.array([[2,3], [2,8], [2,3],[4,5]])

I need to create a PySpark Dataframe from arr. I can not manually input the values because the length/values of arr will be changing dynamically so I need to convert arr into a dataframe.

I tried the following code to no success.

df= sqlContext.createDataFrame(arr,["A", "B"])

However, I get the following error.

TypeError: Can not infer schema for type: <type 'numpy.ndarray'>
like image 850
Bryce Ramgovind Avatar asked Jan 11 '18 12:01

Bryce Ramgovind


2 Answers

import numpy as np

#sample data
arr = np.array([[2,3], [2,8], [2,3],[4,5]])

rdd1 = sc.parallelize(arr)
rdd2 = rdd1.map(lambda x: [int(i) for i in x])
df = rdd2.toDF(["A", "B"])
df.show()

Output is:

+---+---+
|  A|  B|
+---+---+
|  2|  3|
|  2|  8|
|  2|  3|
|  4|  5|
+---+---+
like image 159
1.618 Avatar answered Nov 15 '22 04:11

1.618


No need to use the RDD API. Simply:

mat = np.random.random((10,3))
cols = ["ColA","ColB","ColC"]
df = spark.createDataFrame(mat.tolist(), cols)
df.show()
like image 37
Azmisov Avatar answered Nov 15 '22 06:11

Azmisov