Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big numpy array to spark dataframe

I have big numpy array. Its shape is (800,224,224,3), which means that there are images (224 * 244) with 3 channels. For distributed deep learning in Spark, I want to change 'numpy array' to 'spark dataframe'.

My method is:

  1. Changed numpy array to csv
  2. Loaded csv and make spark dataframe with 150528 columns (224*224*3)
  3. Use VectorAssembler to create a vector of all columns (features)
  4. Reshape the output of 3 but in the third step, I failed since computation might be too much high

In order to make a vector from this:

+------+------+
|col_1 | col_2|
+------+------+
|0.1434|0.1434|
|0.1434|0.1451|
|0.1434|0.1467|
|0.3046|0.3046|
|0.3046|0.3304|
|0.3249|0.3046|
|0.3249|0.3304|
|0.3258|0.3258|
|0.3258|0.3263|
|0.3258|0.3307|
+------+------+

to this:

+-------------+
|   feature   |
+-------------+
|0.1434,0.1434|
|0.1434,0.1451|
|0.1434,0.1467|
|0.3046,0.3046|
|0.3046,0.3304|
|0.3249,0.3046|
|0.3249,0.3304|
|0.3258,0.3258|
|0.3258,0.3263|
|0.3258,0.3307|
+-------------+

But the number of columns are really many...

I also tried to convert numpy array to rdd directly but I got 'out of memory' error. In single machine, my job works well with this numpy array.

like image 865
주은혜 Avatar asked Jan 29 '23 08:01

주은혜


1 Answers

You should be able to convert the numpy array directly to a Spark dataframe, without going through a csv file. You could try something like the below code:

from pyspark.ml.linalg import Vectors

num_rows = 800
arr = map(lambda x: (Vectors.dense(x), ), numpy_arr.reshape(num_rows, -1))
df = spark.createDataFrame(arr, ["features"])
like image 78
Shaido Avatar answered Jan 31 '23 23:01

Shaido