Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge two csv files in powershell with same headers and discard duplicate rows

Tags:

powershell

I am collecting performance counters from NetApp Performance Manager software (OPM). OPM saves 30 days worth of data in MySQL database. So i have to put two queries to retrieve the data:

  1. First Query on 30th of every month and save in csv file.
  2. Second Query on 1st of every month and save in csv file.

Then merge the two csv files to get data if there are 31 days in a month.

Both files look like below:

"Time","objid","cpuBusy","cifsOps","avgLatency"
"2016:06:04 00:04","72","50.6196","2069.11","7622.1"
"2016:06:04 00:09","72","30.2233","2075.94","7633.27"
"2016:06:04 00:14","72","35.2559","1980.64","8352.17"

When i merge the two csv files with below code. I get duplicate rows with data from same data/time.

@(Import-Csv au2004npsa003-mm-business.csv) + @(Import-Csv au2004npsa003-nn-business.csv) | export-csv joined.csv -NoTypeInformation

How can i merge the two csv files without getting duplicate data? I have tried select -unique however, it gives just one row.

like image 306
Nitish Avatar asked Oct 19 '25 12:10

Nitish


1 Answers

As for why Select-Object -Unique didn't work:

  • Select-Object -Unique, when given instances of reference types (other than strings), compares their .ToString() values in order to determine uniqueness.

  • [pscustomobject] instances, such as the ones Import-Csv creates, regrettably return the empty string from their .ToString() method.

    • This long-standing bug, still present as of PowerShell (Core) 7.2, was originally reported in GitHub issue #6163.

Thus, all input objects compare the same, and only the first input object is ever returned.

S9uare's helpful Select-Object -Property * -Unique approach overcomes this problem by forcing all properties to be compared invidually, but comes with a performance caveat:
The input objects are effectively recreated, and comparing all property values is overkill in this case, because comparing Time values would suffice; with large input files, processing can take a long time.


Since the data at hand comes from CSV files, the performance problem can be helped with string processing, using Get-Content rather than Import-Csv:

Get-Content au2004npsa003-mm-business.csv, au2004npsa003-nn-business.csv |
  Select-Object -Unique |
    Set-Content -Encoding ASCII joined.csv

Note that I'm using -Encoding ASCII to mimic Export-Csv's default behavior; change as needed.

With input objects that are strings, Select-Object -Unique works as expected - and is faster.
Note, however, that with large input files that you may run out of memory, given that Select-Object needs to build up an in-memory data structure containing all rows in order to determine uniqueness.

like image 177
mklement0 Avatar answered Oct 21 '25 14:10

mklement0



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!