Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's a simple (Scala only) way to read in and then write out a small .csv file passing through a List[List[String]]?

Tags:

parsing

csv

scala

I have just received a bunch of messy data files in CSV (Comma Separated Value) format. I need to do some normal clean up, validation and filtering work on the data set. I will be doing the clean up in Scala (2.11.7).

In my search to find a solution for both directions, input parsing and output composing, mostly I found lots of ill informed tangents, including one from the "Scala Cookbook", on the input parsing side. And most of those focused on the terribly erroneous solution "use String.split(",")" to get a CSV line back as a List[String]. And I found almost nothing on the composing output side.

What kind of nice simple Scala code snippets exist which can easily do the above described CSV round trip? I'd like to avoid importing an entire library just to pick up these two functions (and using a Java library is not an acceptable option for my business requirements at this time).

like image 725
chaotic3quilibrium Avatar asked Sep 09 '15 20:09

chaotic3quilibrium


1 Answers

UPDATE 2020/08/30: Please use the Scala library, kantan.csv, for the most accurate and correct implementation of RFC 4180 which defines the .csv MIME-type.

While I enjoyed the learning process I experienced creating the solution below, please refrain from using it as I have found a number of issues with it especially at scale. To avoid the obvious technical debt arising from my solution below, choosing a well-maintained RFC driven Scala native solution should be how you take care of your current and future clients.


I have created specific CSV related functions from which a more general solution can be composed.

It turns out that attempting to parse a CSV file is quite tricky due to anomalies around both the comma (,) and the double quote ("). The rules for a CSV are if a column value contains either a comma or a quote, the entire value must be placed in double quotes. And if any double quotes appear in the value, each double quote must be escaped by inserting an additional double quote in front of the existing double quote. This is one of the reasons why the oft cited StringOps.split(",") method simply doesn't work unless one can guarantee they will never encounter a file using the comma/double quote escaping rules. And that's a very unreasonable guarantee.

Additionally, consider that there can be characters between a valid comma separator and the start of a single double quote. Or there can be characters between a final double quote and the next comma or the end of the line. The rules to address this is for those outside-the-double-quote-bounds values to be discarded. This is yet another reason a simple StringOps.split(",") is not only an insufficient answer, but actually incorrect.


One final note about a unexpected behavior I found using StringOps.split(","). Do you know what value result has in this code snippet?:

val result = ",,".split(",")

If you guessed "result references an Array[String] containing three elements of which each is an empty String", you would be incorrect. result references an empty Array[String]. And for me, an empty Array[String] isn't the answer I was expecting or needed. So, for the love of all that is Holy, please Please PLEASE put the final nail in StringOps.split(",") coffin!


So, let's start with the already read in file which is being presented as a List[String]. Below in object Parser is a general solution with two functions; fromLine and fromLines. The latter function, fromLines, is provided for convenience and merely maps across the former function, fromLine.

object Parser {
  def fromLine(line: String): List[String] = {
    def recursive(
        lineRemaining: String
      , isWithinDoubleQuotes: Boolean
      , valueAccumulator: String
      , accumulator: List[String]
    ): List[String] = {
      if (lineRemaining.isEmpty)
        valueAccumulator :: accumulator
      else
        if (lineRemaining.head == '"')
          if (isWithinDoubleQuotes)
            if (lineRemaining.tail.nonEmpty && lineRemaining.tail.head == '"')
              //escaped double quote
              recursive(lineRemaining.drop(2), true, valueAccumulator + '"', accumulator)
            else
              //end of double quote pair (ignore whatever's between here and the next comma)
              recursive(lineRemaining.dropWhile(_ != ','), false, valueAccumulator, accumulator)
          else
            //start of a double quote pair (ignore whatever's in valueAccumulator)
            recursive(lineRemaining.drop(1), true, "", accumulator)
        else
          if (isWithinDoubleQuotes)
            //scan to next double quote
            recursive(
                lineRemaining.dropWhile(_ != '"')
              , true
              , valueAccumulator + lineRemaining.takeWhile(_ != '"')
              , accumulator
            )
          else
            if (lineRemaining.head == ',')
              //advance to next field value
              recursive(
                  lineRemaining.drop(1)
                , false
                , ""
                , valueAccumulator :: accumulator
              )
            else
              //scan to next double quote or comma
              recursive(
                  lineRemaining.dropWhile(char => (char != '"') && (char != ','))
                , false
                , valueAccumulator + lineRemaining.takeWhile(char => (char != '"') && (char != ','))
                , accumulator
              )
    }
    if (line.nonEmpty)
      recursive(line, false, "", Nil).reverse
    else
      Nil
  }
  
  def fromLines(lines: List[String]): List[List[String]] =
    lines.map(fromLine)
}

To validate the above code works for all the various weird input scenarios, some test cases need to be created. So, using the Eclipse ScalaIDE Worksheet, I created a simple set of test cases where I could visually verify the results. Here's the Worksheet contents.

  val testRowsHardcoded: List[String] = {
    val superTrickyTestCase = {
      val dqx1 = '"'
      val dqx2 = dqx1.toString + dqx1.toString
      s"${dqx1}${dqx2}a${dqx2} , ${dqx2}1${dqx1} , ${dqx1}${dqx2}b${dqx2} , ${dqx2}2${dqx1} , ${dqx1}${dqx2}c${dqx2} , ${dqx2}3${dqx1}"
    }
    val nonTrickyTestCases =
"""
,,
a,b,c
a,,b,,c
 a, b, c
a ,b ,c
 a , b , c
"a,1","b,2","c,2"
"a"",""1","b"",""2","c"",""2"
 "a"" , ""1" , "b"" , ""2" , "c"",""2"
""".split("\n").tail.toList
   (superTrickyTestCase :: nonTrickyTestCases.reverse).reverse
  }
  val parsedLines =
    Parser.fromLines(testRowsHardcoded)
  parsedLines.map(_.mkString("|")).mkString("\n")

I visually verifyed the tests completed correctly and had left me with decomposed accurate raw strings. So, I now had what I needed for the input parsing side so I could begin my data refining.

After data refining was completed, I needed to be able to compose output so I could send my refined data back out reapplying all the CSV encoding rules.

So, let's start with a List[List[String]] as the source of the refinements. Below in object Composer is a general solution with two functions; toLine and toLines. The latter function, toLines, is provided for convenience and merely maps across the former function, toLine.

object Composer {
  def toLine(line: List[String]): String = {
    def encode(value: String): String = {
      if ((value.indexOf(',') < 0) && (value.indexOf('"') < 0))
        //no commas or double quotes, so nothing to encode
        value
      else
        //found a comma or a double quote,
        //  so double all the double quotes
        //  and then surround the whole result with double quotes
        "\"" + value.replace("\"", "\"\"") + "\""
    }
    if (line.nonEmpty)
      line.map(encode(_)).mkString(",")
    else
      ""
  }

  def toLines(lines: List[List[String]]): List[String] =
    lines.map(toLine)
}

To validate the above code works for all the various weird input scenarios, I reused the test cases I used for Parser. Again, using the Eclipse ScalaIDE Worksheet, I added a bit more code below my existing code where I could visually verify the results. Here's the the code I added:

val composedLines =
  Composer.toLines(parsedLines)
composedLines.mkString("\n")
val parsedLines2 =
  Parser.fromLines(composedLines)
parsedLines == parsedLines2

When the Scala WorkSheet is saved, it executes its contents. The very last line should show a value of "true". It is the result of round tripping all the test cases through the parser, through the composer and back through the parser.

BTW, it turns out there is a ton of variation around the definition of a "CSV file". So, here's the source for the rules the code above enforces.

PS. Thanks to @dhg pointing it out, there is a CSV Scala library which handles parsing CSVs, just in case you want something which is likely more robust and has more options than my Scala code snippets above.

like image 69
chaotic3quilibrium Avatar answered Nov 14 '22 20:11

chaotic3quilibrium