Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strongly typed access to csv in scala?

I would like to access csv files in scala in a strongly typed manner. For example, as I read each line of the csv, it is automatically parsed and represented as a tuple with the appropriate types. I could specify the types beforehand in some sort of schema that is passed to the parser. Are there any libraries that exist for doing this? If not, how could I go about implementing this functionality on my own?

like image 853
mushroom Avatar asked Jun 15 '13 17:06

mushroom


2 Answers

product-collections appears to be a good fit for your requirements:

scala> val data = CsvParser[String,Int,Double].parseFile("sample.csv")
data: com.github.marklister.collections.immutable.CollSeq3[String,Int,Double] = 
CollSeq((Jan,10,22.33),
        (Feb,20,44.2),
        (Mar,25,55.1))

product-collections uses opencsv under the hood.

A CollSeq3 is an IndexedSeq[Product3[T1,T2,T3]] and also a Product3[Seq[T1],Seq[T2],Seq[T3]] with a little sugar. I am the author of product-collections.

Here's a link to the io page of the scaladoc

Product3 is essentially a tuple of arity 3.

like image 57
Mark Lister Avatar answered Sep 28 '22 16:09

Mark Lister


If your content has double-quotes to enclose other double quotes, commas and newlines, I would definitely use a library like opencsv that deals properly with special characters. Typically you end up with Iterator[Array[String]]. Then you use Iterator.map or collect to transform each Array[String] into your tuples dealing with type conversions errors there. If you need to do process the input without loading all in memory, you then keep working with the iterator, otherwise you can convert to a Vector or List and close the input stream.

So it may look like this:

val reader = new CSVReader(new FileReader(filename))
val iter = reader.iterator()
val typed = iter collect {
  case Array(double, int, string) => (double.toDouble, int.toInt, string)
}
// do more work with typed
// close reader in a finally block

Depending on how you need to deal with errors, you can return Left for errors and Right for success tuples to separate the errors from the correct rows. Also, I sometimes wrap of all this using scala-arm for closing resources. So my data maybe wrapped into the resource.ManagedResource monad so that I can use input coming from multiple files.

Finally, although you want to work with tuples, I have found that it is usually clearer to have a case class that is appropriate for the problem and then write a method that creates that case class object from an Array[String].

like image 31
huynhjl Avatar answered Sep 28 '22 15:09

huynhjl