I need to be able to figure out which delimiter is being used in a csv file (comma, space or semicolon) in my Ruby project. I know, there is a Sniffer class in Python in the csv module that can be used to guess a given file's delimiter. Is there anything similar to this in Ruby ? Any kind of help or idea is greatly appreciated.
Looks like the py implementation just checks a few dialects: excel or excel_tab. So, a simple implementation of something that just checks for ","
or "\t"
is:
COMMON_DELIMITERS = ['","',"\"\t\""].freeze
def sniff(path)
first_line = File.open(path).first
return unless first_line
snif = {}
COMMON_DELIMITERS.each do |delim|
snif[delim] = first_line.count(delim)
end
snif = snif.sort { |a,b| b[1]<=>a[1] }
snif[0][0] if snif.size > 0
end
Note: that would return the full delimiter it finds, e.g. ","
, so to get ,
you could change the snif[0][0]
to snif[0][0][1]
.
Also, I'm using count(delim)
because it is a little faster, but if you added a delimiter that is composed of two (or more) characters of the same type like --
, then it would could each occurrence twice (or more) when weighing the type, so in that case, it may be better to use scan(delim).length
.
Here is Gary S. Weaver answer as we are using it in production. Good solution that works well.
class ColSepSniffer
NoColumnSeparatorFound = Class.new(StandardError)
EmptyFile = Class.new(StandardError)
COMMON_DELIMITERS = [
'","',
'"|"',
'";"'
].freeze
def initialize(path:)
@path = path
end
def self.find(path)
new(path: path).find
end
def find
fail EmptyFile unless first
if valid?
delimiters[0][0][1]
else
fail NoColumnSeparatorFound
end
end
private
def valid?
!delimiters.collect(&:last).reduce(:+).zero?
end
# delimiters #=> [["\"|\"", 54], ["\",\"", 0], ["\";\"", 0]]
# delimiters[0] #=> ["\";\"", 54]
# delimiters[0][0] #=> "\",\""
# delimiters[0][0][1] #=> ";"
def delimiters
@delimiters ||= COMMON_DELIMITERS.inject({}, &count).sort(&most_found)
end
def most_found
->(a, b) { b[1] <=> a[1] }
end
def count
->(hash, delimiter) { hash[delimiter] = first.count(delimiter); hash }
end
def first
@first ||= file.first
end
def file
@file ||= File.open(@path)
end
end
Spec
require "spec_helper"
describe ColSepSniffer do
describe ".find" do
subject(:find) { described_class.find(path) }
let(:path) { "./spec/fixtures/google/products.csv" }
context "when , delimiter" do
it "returns separator" do
expect(find).to eq(',')
end
end
context "when ; delimiter" do
let(:path) { "./spec/fixtures/google/products_with_semi_colon_seperator.csv" }
it "returns separator" do
expect(find).to eq(';')
end
end
context "when | delimiter" do
let(:path) { "./spec/fixtures/google/products_with_bar_seperator.csv" }
it "returns separator" do
expect(find).to eq('|')
end
end
context "when empty file" do
it "raises error" do
expect(File).to receive(:open) { [] }
expect { find }.to raise_error(described_class::EmptyFile)
end
end
context "when no column separator is found" do
it "raises error" do
expect(File).to receive(:open) { [''] }
expect { find }.to raise_error(described_class::NoColumnSeparatorFound)
end
end
end
end
I'm not aware of any sniffer implementation in the CSV library included in Ruby 1.9. It will try to auto-discover the row separator, but the column separator is assumed to be a comma by default.
One idea would be to try parsing a sample number of rows (5% of total maybe?) using each of the possible separators. Whichever separator results in the same number of columns most consistently is probably the correct separator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With