What's the best (most efficient) way to parse a tab-delimited file in Ruby?
CSV stands for “Comma-Separated Values”. It's a common data format which consist of rows with values separated by commas. It's used for exporting & importing data. For example: You can export your Gmail contacts as a CSV file, and you can also import them using the same format.
A tab-separated values (TSV) file is a text format whose primary function is to store data in a table structure where each record in the table is recorded as one line of the text file.
You can use the csv module to parse tab seperated value files easily. import csv with open("tab-separated-values") as tsv: for line in csv. reader(tsv, dialect="excel-tab"): #You can also use delimiter="\t" rather than giving a dialect. ... Where line is a list of the values on the current row for each iteration.
The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:
require "csv" parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")
The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:
require 'csv' line = 'boogie\ttime\tis "now"' begin line = CSV.parse_line(line, col_sep: "\t") puts "parsed correctly" rescue CSV::MalformedCSVError puts "failed to parse line" end begin line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ") puts "parsed correctly with random quote char" rescue CSV::MalformedCSVError puts "failed to parse line with random quote char" end #Output: # failed to parse line # parsed correctly with random quote char
If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.
# The main parse method is mostly borrowed from a tweet by @JEG2 class StrictTsv attr_reader :filepath def initialize(filepath) @filepath = filepath end def parse open(filepath) do |f| headers = f.gets.strip.split("\t") f.each do |line| fields = Hash[headers.zip(line.split("\t"))] yield fields end end end end # Example Usage tsv = Vendor::StrictTsv.new("your_file.tsv") tsv.parse do |row| puts row['named field'] end
The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.
Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With