Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to parse a tab-delimited file in Ruby?

Tags:

ruby

tsv

What's the best (most efficient) way to parse a tab-delimited file in Ruby?

like image 788
mbm Avatar asked Dec 10 '10 01:12

mbm


People also ask

What is CSV file in Ruby?

CSV stands for “Comma-Separated Values”. It's a common data format which consist of rows with values separated by commas. It's used for exporting & importing data. For example: You can export your Gmail contacts as a CSV file, and you can also import them using the same format.

What is a tab separated text file?

A tab-separated values (TSV) file is a text format whose primary function is to store data in a table structure where each record in the table is recorded as one line of the text file.

How do you parse a tab separated in Python?

You can use the csv module to parse tab seperated value files easily. import csv with open("tab-separated-values") as tsv: for line in csv. reader(tsv, dialect="excel-tab"): #You can also use delimiter="\t" rather than giving a dialect. ... Where line is a list of the values on the current row for each iteration.


2 Answers

The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:

require "csv" parsed_file = CSV.read("path-to-file.csv", col_sep: "\t") 
like image 114
jergason Avatar answered Oct 09 '22 10:10

jergason


The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:

require 'csv' line = 'boogie\ttime\tis "now"' begin   line = CSV.parse_line(line, col_sep: "\t")   puts "parsed correctly" rescue CSV::MalformedCSVError   puts "failed to parse line" end  begin   line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")   puts "parsed correctly with random quote char" rescue CSV::MalformedCSVError   puts "failed to parse line with random quote char" end  #Output: # failed to parse line # parsed correctly with random quote char 

If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.

# The main parse method is mostly borrowed from a tweet by @JEG2 class StrictTsv   attr_reader :filepath   def initialize(filepath)     @filepath = filepath   end    def parse     open(filepath) do |f|       headers = f.gets.strip.split("\t")       f.each do |line|         fields = Hash[headers.zip(line.split("\t"))]         yield fields       end     end   end end  # Example Usage tsv = Vendor::StrictTsv.new("your_file.tsv") tsv.parse do |row|   puts row['named field'] end 

The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.

Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values

like image 27
mmmries Avatar answered Oct 09 '22 12:10

mmmries