Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse CSV, ignoring commas inside string literals in VBA?

I have a VBA application that runs every day. It checks a folder where CSVs are downloaded automatically, and adds their contents to a database. When parsing them, I realized that certain values had commas as a part of their name. These values were contained in string literals.

So I'm trying to figure out how to parse this CSV and ignore commas that are contained in string literals. For example...

1,2,3,"This should,be one part",5,6,7 Should return 

1
2
3
"This should,be one part"
5
6
7

I have been using VBA's split() function, because I don't wanna reinvent the wheel, but if I have to I guess I'll do something else.

Any suggestions would be appreciated.

like image 923
Tom Avatar asked Jul 21 '11 18:07

Tom


1 Answers

The first way to solve this problem is to look at the structure of the line from the csv file (int,int,"String literal, will have at most one comma", etc). A naive solution would be (Assuming that the line don't have any semicolons)

Function splitLine1(line As String) As String()

   Dim temp() As String
   'Splits the line in three. The string delimited by " will be at temp(1)
   temp = Split(line, Chr(34)) 'chr(34) = "

   'Replaces the commas in the numeric fields by semicolons
   temp(0) = Replace(temp(0), ",", ";")
   temp(2) = Replace(temp(2), ",", ";")

   'Joins the temp array with quotes and then splits the result using the semicolons
   splitLine1 = Split(Join(temp, Chr(34)), ";")

End Function

This function only solves this particular problem. Another way to do the job is using the regular expression object from VBScript.

Function splitLine2(line As String) As String()

    Dim regex As Object
    Set regex = CreateObject("vbscript.regexp")
    regex.IgnoreCase = True
    regex.Global = True

    'This pattern matches only commas outside quotes
    'Pattern = ",(?=([^"]*"[^"]*")*(?![^"]*"))"
    regex.Pattern = ",(?=([^" & Chr(34) & "]*" & Chr(34) & "[^" & Chr(34) & "]*" & Chr(34) & ")*(?![^" & Chr(34) & "]*" & Chr(34) & "))"

    'regex.replaces will replace the commas outside quotes with semicolons and then the
    'Split function will split the result based on the semicollons
    splitLine2 = Split(regex.Replace(line, ";"), ";")

End Function

This method seems much more cryptic, but does not deppends on the structure of the line

You can read more about regular expressions patterns in VBScript Here

like image 162
kb_sou Avatar answered Oct 15 '22 07:10

kb_sou