Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A Perl script to process a CSV file, aggregating properties spread over multiple records

Tags:

csv

perl

Sorry for the vague question, I'm struggling to think how to better word it!

I have a CSV file that looks a little like this, only a lot bigger:

550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4

The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:

550672,1;2;3;4
656372,1;2
766153,1;4

etc.

I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!

like image 284
user1597452 Avatar asked Dec 15 '22 19:12

user1597452


2 Answers

I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience

It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you

use strict;
use warnings;

use Text::CSV;

my $csv = Text::CSV->new;

open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
  $csv->parse($line) or die "Invalid data line";
  my ($key, $val) = $csv->fields;
  push @{ $data{$key} }, $val
}

for my $id (sort keys %data) {
  printf "%s,%s\n", $id, join ';', @{ $data{$id} };
}

output

550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
like image 155
Borodin Avatar answered May 13 '23 23:05

Borodin


Firstly props for seeking an approach not a solution. As you've probably already found with perl, There Is More Than One Way To Do It.

The approach I would take would be;

use strict;  # will save you big time in the long run

my %ids      # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
  split line into ID and property variable  # google the split function
  append new property to existing properties for this id in the hash table  # If it doesn't exist already, it will be created
}

foreach my $key (keys %ids) {
  deduplicate properties
  print/display/do whatever you need to do with the result
}

This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem. A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.

Check out this question for a discussion on how to do the deduplication.

like image 20
TaninDirect Avatar answered May 14 '23 00:05

TaninDirect