I would like to access csv files in scala in a strongly typed manner. For example, as I read each line of the csv, it is automatically parsed and represented as a tuple with the appropriate types. I could specify the types beforehand in some sort of schema that is passed to the parser. Are there any libraries that exist for doing this? If not, how could I go about implementing this functionality on my own?
product-collections appears to be a good fit for your requirements:
scala> val data = CsvParser[String,Int,Double].parseFile("sample.csv")
data: com.github.marklister.collections.immutable.CollSeq3[String,Int,Double] =
CollSeq((Jan,10,22.33),
(Feb,20,44.2),
(Mar,25,55.1))
product-collections uses opencsv under the hood.
A CollSeq3
is an IndexedSeq[Product3[T1,T2,T3]]
and also a Product3[Seq[T1],Seq[T2],Seq[T3]]
with a little sugar. I am the author of product-collections.
Here's a link to the io page of the scaladoc
Product3 is essentially a tuple of arity 3.
If your content has double-quotes to enclose other double quotes, commas and newlines, I would definitely use a library like opencsv that deals properly with special characters. Typically you end up with Iterator[Array[String]]
. Then you use Iterator.map
or collect
to transform each Array[String]
into your tuples dealing with type conversions errors there. If you need to do process the input without loading all in memory, you then keep working with the iterator, otherwise you can convert to a Vector
or List
and close the input stream.
So it may look like this:
val reader = new CSVReader(new FileReader(filename))
val iter = reader.iterator()
val typed = iter collect {
case Array(double, int, string) => (double.toDouble, int.toInt, string)
}
// do more work with typed
// close reader in a finally block
Depending on how you need to deal with errors, you can return Left
for errors and Right
for success tuples to separate the errors from the correct rows. Also, I sometimes wrap of all this using scala-arm for closing resources. So my data maybe wrapped into the resource.ManagedResource
monad so that I can use input coming from multiple files.
Finally, although you want to work with tuples, I have found that it is usually clearer to have a case class that is appropriate for the problem and then write a method that creates that case class object from an Array[String]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With