I have a csv file where each row defines a room in a given building. Along with room, each row has a floor field. What I want to extract is all floors in all buildings.
My file looks like this...
"u_floor","u_room","name" 0,"00BDF","AIRPORT TEST " 0,0,"BRICKER HALL, JOHN W " 0,3,"BRICKER HALL, JOHN W " 0,5,"BRICKER HALL, JOHN W " 0,6,"BRICKER HALL, JOHN W " 0,7,"BRICKER HALL, JOHN W " 0,8,"BRICKER HALL, JOHN W " 0,9,"BRICKER HALL, JOHN W " 0,19,"BRICKER HALL, JOHN W " 0,20,"BRICKER HALL, JOHN W " 0,21,"BRICKER HALL, JOHN W " 0,25,"BRICKER HALL, JOHN W " 0,27,"BRICKER HALL, JOHN W " 0,29,"BRICKER HALL, JOHN W " 0,35,"BRICKER HALL, JOHN W " 0,45,"BRICKER HALL, JOHN W " 0,59,"BRICKER HALL, JOHN W " 0,60,"BRICKER HALL, JOHN W " 0,61,"BRICKER HALL, JOHN W " 0,63,"BRICKER HALL, JOHN W " 0,"0006M","BRICKER HALL, JOHN W " 0,"0008A","BRICKER HALL, JOHN W " 0,"0008B","BRICKER HALL, JOHN W " 0,"0008C","BRICKER HALL, JOHN W " 0,"0008D","BRICKER HALL, JOHN W " 0,"0008E","BRICKER HALL, JOHN W " 0,"0008F","BRICKER HALL, JOHN W " 0,"0008G","BRICKER HALL, JOHN W " 0,"0008H","BRICKER HALL, JOHN W "
What I want is all floors in all buildings.
I am using cat, awk, sort and uniq to obtain this list although I am having a problem with the "," in the building name field such as "BRICKER HALL, JOHN W" and it is throwing off my entire csv generation.
cat Buildings.csv | awk -F, '{print $1","$2}' | sort | uniq > Floors.csv
How can I get awk to use the comma but ignore a comma in between "" of a field? Alternatively, does someone have a better solution?
Based on the answer provided suggesting a awk csv parser I was able to get the solution:
cat Buildings.csv | awk -f csv.awk | awk -F" -> 2|" '{print $2}' | awk -F"|" '{print $2","$3}' | sort | uniq > floors.csv
There we want to use the csv awk program and then from there I want to use a " -> 2|" which is formatting based on the csv awk program. The print $2 there prints only the csv parsed contents, this is because the program prints the original line followed by " -> #" where # is the count parsed from csv. (Ie. the columns.) From there I can split this awk csv result on the "|" whcih is what it replaces the comma's with. Then the sort, uniq and pipe out to a file and done!
Thanks for the help.
Re: Handling 'comma' in the data while writing to a CSV. So for data fields that contain a comma, you should just be able to wrap them in a double quote. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
More installation instructions found in the readme. And you can pretend that AWK natively supports CSV files. ( You can use this same trick with other UNIX line-oriented tools. head , tail and sort don't understand CSV either, but if you wrap them in csvquote you will be able to handle delimited line breaks correctly.)
You can use AWK to quickly look at a column of data in a CSV file.
For a CSV file the FPAT value is: FPAT = "([^,]+)|(\"[^\"]+\")" Using the data: abc,"pqr,mno" The first grouped expression evaluates to everything i.e. not a comma, this should take "abc" as data then fail for the first occurrence of comma.
gawk -vFPAT='[^,]*|"[^"]*"' '{print $1 "," $3}' | sort | uniq
This is an awesome GNU Awk 4 extension, where you define a field pattern instead of a field-separator pattern. Does wonders for CSV. (docs)
ETA (thanks mitchus): To remove the surrounding quotes, gsub("^\"|\"$","",$3)
; if there's more fields than just $3
to process that way, just loop through them.
Note this simple approach is not tolerant of malformed input, nor of some possible special characters between quotes – covering all of those would go beyond the scope of a neat one-liner.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With