Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby how to merge two CSV files with slightly different headers

Tags:

merge

ruby

csv

I have two CSV files with some common headers and others that only appear in one or in the other, for example:

# csv_1.csv
H1,H2,H3
V11,V22,V33
V14,V25,V35
# csv_2.csv
H1,H4
V1a,V4b
V1c,V4d

I would like to merge both and obtain a new CSV file that combines all the information for the previous CSV files. Injecting new columns when needed, and feeding the new cells with null values.

Result example:

H1,H2,H3,H4
V11,V22,V33,
V14,V25,V35,
V1a,,,V4b
V1c,,,V4d 
like image 550
fguillen Avatar asked Aug 01 '14 17:08

fguillen


3 Answers

Challenge accepted :)

#!/usr/bin/env ruby
require "csv"

module MergeCsv
  class << self
    def run(csv_paths)
      csv_files = csv_paths.map { |p| CSV.read(p, headers: true) }
      merge(csv_files)
    end

    private

    def merge(csv_files)
      headers    = csv_files.flat_map(&:headers).uniq.sort
      hash_array = csv_files.flat_map(&method(:csv_to_hash_array))

      CSV.generate do |merged_csv|
        merged_csv << headers

        hash_array.each do |row|
          merged_csv << row.values_at(*headers)
        end
      end
    end

    # Probably not the most performant way, but easy
    def csv_to_hash_array(csv)
      csv.to_a[1..-1].map { |row| csv.headers.zip(row).to_h }
    end
  end
end

if(ARGV.length == 0)
  puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2>"
  exit 1
end

puts MergeCsv.run(ARGV)
like image 120
ichigolas Avatar answered Oct 24 '22 04:10

ichigolas


I have the answer, I just wanted to help people that is looking for the same solution

require "csv"

module MergeCsv
  def self.run(csv_1_path, csv_2_path)
    merge(File.read(csv_1_path), File.read(csv_2_path))
  end

  def self.merge(csv_1, csv_2)
    csv_1_table = CSV.parse(csv_1, :headers => true)
    csv_2_table = CSV.parse(csv_2, :headers => true)

    return csv_2_table.to_csv if csv_1_table.headers.empty?
    return csv_1_table.to_csv if csv_2_table.headers.empty?

    headers_in_1_not_in_2 = csv_1_table.headers - csv_2_table.headers
    headers_in_1_not_in_2.each do |header_in_1_not_in_2|
      csv_2_table[header_in_1_not_in_2] = nil
    end

    headers_in_2_not_in_1 = csv_2_table.headers - csv_1_table.headers
    headers_in_2_not_in_1.each do |header_in_2_not_in_1|
      csv_1_table[header_in_2_not_in_1] = nil
    end

    csv_2_table.each do |csv_2_row|
      csv_1_table << csv_1_table.headers.map { |csv_1_header| csv_2_row[csv_1_header] }
    end

    csv_1_table.to_csv
  end
end

if(ARGV.length != 2)
  puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2>"
  exit 1
end

puts MergeCsv.run(ARGV[0], ARGV[1])

And execute it from the console this way:

$ ruby merge_csv.rb csv_1.csv csv_2.csv 

Any other, maybe cleaner, solution is welcome.

like image 20
fguillen Avatar answered Oct 24 '22 04:10

fguillen


Simplied first answer:

How to use it:

    listPart_A = CSV.read(csv_path_A, headers:true)

    listPart_B = CSV.read(csv_path_B, headers:true)

    listPart_C = CSV.read(csv_path_C, headers:true)

    list = merge(listPart_A,listPart_B,listPart_C)

Function:

def merge(*csvs)
    headers = csvs.map {|csv| csv.headers }.flatten.compact.uniq.sort
    csvs.flat_map(&method(:csv_to_hash_array))
end

def csv_to_hash_array(csv)
    csv.to_a[1..-1].map do |row|
      Hash[csv.headers.zip(row)]
    end
end
like image 44
Thomas Avatar answered Oct 24 '22 04:10

Thomas