I have just received a bunch of messy data files in CSV (Comma Separated Value) format. I need to do some normal clean up, validation and filtering work on the data set. I will be doing the clean up in Scala (2.11.7).
In my search to find a solution for both directions, input parsing and output composing, mostly I found lots of ill informed tangents, including one from the "Scala Cookbook", on the input parsing side. And most of those focused on the terribly erroneous solution "use String.split(",")
" to get a CSV line back as a List[String]
. And I found almost nothing on the composing output side.
What kind of nice simple Scala code snippets exist which can easily do the above described CSV round trip? I'd like to avoid importing an entire library just to pick up these two functions (and using a Java library is not an acceptable option for my business requirements at this time).
UPDATE 2020/08/30: Please use the Scala library, kantan.csv, for the most accurate and correct implementation of RFC 4180 which defines the .csv
MIME-type.
While I enjoyed the learning process I experienced creating the solution below, please refrain from using it as I have found a number of issues with it especially at scale. To avoid the obvious technical debt arising from my solution below, choosing a well-maintained RFC driven Scala native solution should be how you take care of your current and future clients.
I have created specific CSV related functions from which a more general solution can be composed.
It turns out that attempting to parse a CSV file is quite tricky due to anomalies around both the comma (,) and the double quote ("). The rules for a CSV are if a column value contains either a comma or a quote, the entire value must be placed in double quotes. And if any double quotes appear in the value, each double quote must be escaped by inserting an additional double quote in front of the existing double quote. This is one of the reasons why the oft cited StringOps.split(",")
method simply doesn't work unless one can guarantee they will never encounter a file using the comma/double quote escaping rules. And that's a very unreasonable guarantee.
Additionally, consider that there can be characters between a valid comma separator and the start of a single double quote. Or there can be characters between a final double quote and the next comma or the end of the line. The rules to address this is for those outside-the-double-quote-bounds values to be discarded. This is yet another reason a simple StringOps.split(",")
is not only an insufficient answer, but actually incorrect.
One final note about a unexpected behavior I found using StringOps.split(",")
. Do you know what value result has in this code snippet?:
val result = ",,".split(",")
If you guessed "result
references an Array[String]
containing three elements of which each is an empty String
", you would be incorrect. result
references an empty Array[String]
. And for me, an empty Array[String]
isn't the answer I was expecting or needed. So, for the love of all that is Holy, please Please PLEASE put the final nail in StringOps.split(",")
coffin!
So, let's start with the already read in file which is being presented as a List[String]
. Below in object Parser
is a general solution with two functions; fromLine
and fromLines
. The latter function, fromLines
, is provided for convenience and merely maps across the former function, fromLine
.
object Parser {
def fromLine(line: String): List[String] = {
def recursive(
lineRemaining: String
, isWithinDoubleQuotes: Boolean
, valueAccumulator: String
, accumulator: List[String]
): List[String] = {
if (lineRemaining.isEmpty)
valueAccumulator :: accumulator
else
if (lineRemaining.head == '"')
if (isWithinDoubleQuotes)
if (lineRemaining.tail.nonEmpty && lineRemaining.tail.head == '"')
//escaped double quote
recursive(lineRemaining.drop(2), true, valueAccumulator + '"', accumulator)
else
//end of double quote pair (ignore whatever's between here and the next comma)
recursive(lineRemaining.dropWhile(_ != ','), false, valueAccumulator, accumulator)
else
//start of a double quote pair (ignore whatever's in valueAccumulator)
recursive(lineRemaining.drop(1), true, "", accumulator)
else
if (isWithinDoubleQuotes)
//scan to next double quote
recursive(
lineRemaining.dropWhile(_ != '"')
, true
, valueAccumulator + lineRemaining.takeWhile(_ != '"')
, accumulator
)
else
if (lineRemaining.head == ',')
//advance to next field value
recursive(
lineRemaining.drop(1)
, false
, ""
, valueAccumulator :: accumulator
)
else
//scan to next double quote or comma
recursive(
lineRemaining.dropWhile(char => (char != '"') && (char != ','))
, false
, valueAccumulator + lineRemaining.takeWhile(char => (char != '"') && (char != ','))
, accumulator
)
}
if (line.nonEmpty)
recursive(line, false, "", Nil).reverse
else
Nil
}
def fromLines(lines: List[String]): List[List[String]] =
lines.map(fromLine)
}
To validate the above code works for all the various weird input scenarios, some test cases need to be created. So, using the Eclipse ScalaIDE Worksheet, I created a simple set of test cases where I could visually verify the results. Here's the Worksheet contents.
val testRowsHardcoded: List[String] = {
val superTrickyTestCase = {
val dqx1 = '"'
val dqx2 = dqx1.toString + dqx1.toString
s"${dqx1}${dqx2}a${dqx2} , ${dqx2}1${dqx1} , ${dqx1}${dqx2}b${dqx2} , ${dqx2}2${dqx1} , ${dqx1}${dqx2}c${dqx2} , ${dqx2}3${dqx1}"
}
val nonTrickyTestCases =
"""
,,
a,b,c
a,,b,,c
a, b, c
a ,b ,c
a , b , c
"a,1","b,2","c,2"
"a"",""1","b"",""2","c"",""2"
"a"" , ""1" , "b"" , ""2" , "c"",""2"
""".split("\n").tail.toList
(superTrickyTestCase :: nonTrickyTestCases.reverse).reverse
}
val parsedLines =
Parser.fromLines(testRowsHardcoded)
parsedLines.map(_.mkString("|")).mkString("\n")
I visually verifyed the tests completed correctly and had left me with decomposed accurate raw strings. So, I now had what I needed for the input parsing side so I could begin my data refining.
After data refining was completed, I needed to be able to compose output so I could send my refined data back out reapplying all the CSV encoding rules.
So, let's start with a List[List[String]]
as the source of the refinements. Below in object Composer
is a general solution with two functions; toLine
and toLines
. The latter function, toLines
, is provided for convenience and merely maps across the former function, toLine
.
object Composer {
def toLine(line: List[String]): String = {
def encode(value: String): String = {
if ((value.indexOf(',') < 0) && (value.indexOf('"') < 0))
//no commas or double quotes, so nothing to encode
value
else
//found a comma or a double quote,
// so double all the double quotes
// and then surround the whole result with double quotes
"\"" + value.replace("\"", "\"\"") + "\""
}
if (line.nonEmpty)
line.map(encode(_)).mkString(",")
else
""
}
def toLines(lines: List[List[String]]): List[String] =
lines.map(toLine)
}
To validate the above code works for all the various weird input scenarios, I reused the test cases I used for Parser. Again, using the Eclipse ScalaIDE Worksheet, I added a bit more code below my existing code where I could visually verify the results. Here's the the code I added:
val composedLines =
Composer.toLines(parsedLines)
composedLines.mkString("\n")
val parsedLines2 =
Parser.fromLines(composedLines)
parsedLines == parsedLines2
When the Scala WorkSheet is saved, it executes its contents. The very last line should show a value of "true". It is the result of round tripping all the test cases through the parser, through the composer and back through the parser.
BTW, it turns out there is a ton of variation around the definition of a "CSV file". So, here's the source for the rules the code above enforces.
PS. Thanks to @dhg pointing it out, there is a CSV Scala library which handles parsing CSVs, just in case you want something which is likely more robust and has more options than my Scala code snippets above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With