Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient load CSV coordinate format (COO) input to local matrix spark

I want to convert CSV coordinate format (COO) data into a local matrix. Currently I'm first converting them to CoordinateMatrix and then converting to LocalMatrix. But is there a better way to do this?

Example data:

0,5,5.486978435
0,3,0.438472867
0,0,6.128832321
0,7,5.295923198
0,1,7.738270234

Code:

var loadG = sqlContext.read.option("header", "false").csv("file.csv").rdd.map("mapfunctionCreatingMatrixEntryOutOfRow")
var G = new CoordinateMatrix(loadG)

var matrixG = G.toBlockMatrix().toLocalMatrix()
like image 861
Ardit Meti Avatar asked Jan 30 '18 15:01

Ardit Meti


People also ask

How do I read a csv file using SparkContext in PySpark?

To read multiple CSV files in Spark, just use textFile() method on SparkContext object by passing all file names comma separated. The below example reads text01. csv & text02.

How do I read a csv file into a DataFrame in PySpark?

Using csv("path") or format("csv"). load("path") of DataFrameReader, you can read a CSV file into a PySpark DataFrame, These methods take a file path to read from as an argument.


1 Answers

A LocalMatrix will be stored on a single machine and hence not make use of Spark's strengths. In other words, using Spark seems a bit wasteful, although still possible.

The easiest way to get the CSV file to a LocalMatrix is to first read the CSV with Scala, not Spark:

val entries = Source.fromFile("data.csv").getLines()
  .map(_.split(","))
  .map(a => (a(0).toInt, a(1).toInt, a(2).toDouble))
  .toSeq

The SparseMatrix variant of the LocalMatrix has a method for reading COO formatted data. The number of rows and columns need to be specified to use this. Since the matrix is sparse this should in most cases be done by hand but it's possible to get the highest values in the data as follows:

val numRows = entries.map(_._1).max + 1
val numCols = entries.map(_._2).max + 1

Then create the matrix:

val matrixG = SparseMatrix.fromCOO(numRows, numCols, entries)

The matrix will be stored in CSC format on the machine. Printing the example input above will yield the following output:

1 x 8 CSCMatrix
(0,0) 6.128832321
(0,1) 7.738270234
(0,3) 0.438472867
(0,5) 5.486978435
(0,7) 5.295923198
like image 51
Shaido Avatar answered Sep 25 '22 17:09

Shaido