Big numpy array to spark dataframe

Question

I have big numpy array. Its shape is (800,224,224,3), which means that there are images (224 * 244) with 3 channels. For distributed deep learning in Spark, I want to change 'numpy array' to 'spark dataframe'.

My method is:

Changed numpy array to csv
Loaded csv and make spark dataframe with 150528 columns (224*224*3)
Use VectorAssembler to create a vector of all columns (features)
Reshape the output of 3 but in the third step, I failed since computation might be too much high

In order to make a vector from this:

+------+------+
|col_1 | col_2|
+------+------+
|0.1434|0.1434|
|0.1434|0.1451|
|0.1434|0.1467|
|0.3046|0.3046|
|0.3046|0.3304|
|0.3249|0.3046|
|0.3249|0.3304|
|0.3258|0.3258|
|0.3258|0.3263|
|0.3258|0.3307|
+------+------+

to this:

+-------------+
|   feature   |
+-------------+
|0.1434,0.1434|
|0.1434,0.1451|
|0.1434,0.1467|
|0.3046,0.3046|
|0.3046,0.3304|
|0.3249,0.3046|
|0.3249,0.3304|
|0.3258,0.3258|
|0.3258,0.3263|
|0.3258,0.3307|
+-------------+

But the number of columns are really many...

I also tried to convert numpy array to rdd directly but I got 'out of memory' error. In single machine, my job works well with this numpy array.

Shaido · Accepted Answer

You should be able to convert the numpy array directly to a Spark dataframe, without going through a csv file. You could try something like the below code:

from pyspark.ml.linalg import Vectors

num_rows = 800
arr = map(lambda x: (Vectors.dense(x), ), numpy_arr.reshape(num_rows, -1))
df = spark.createDataFrame(arr, ["features"])

Big numpy array to spark dataframe

Tags:

numpy

apache-spark

rdd

apache-spark-sql

pyspark

주은혜

1 Answers

Shaido

Recent Activity

Donate For Us

Big numpy array to spark dataframe

Tags:

numpy

apache-spark

rdd

apache-spark-sql

pyspark

주은혜

1 Answers

Shaido

Related questions

Recent Activity

Donate For Us