Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replace thousands separators in csv with regex

I'm running into problems trying to pull the thousands separators out of some currency values in a set of files. The "bad" values are delimited with commas and double quotes. There are other values in there that are < $1000 that present no issue.

Example of existing file:

"12,345.67",12.34,"123,456.78",1.00,"123,456,789.12"

Example of desired file (thousands separators removed):

"12345.67",12.34,"123456.78",1.00,"123456789.12"

I found a regex expression for matching the numbers with separators that works great, but I'm having trouble with the -replace operator. The replacement value is confusing me. I read about $& and I'm wondering if I should use that here. I tried $_, but that pulls out ALL my commas. Do I have to use $matches somehow?

Here's my code:

$Files = Get-ChildItem *input.csv
foreach ($file in $Files)
    {
        $file |
        Get-Content | #assume that I can't use -raw
        % {$_ -replace '"[\d]{1,3}(,[\d]{3})*(\.[\d]+)?"', ("$&" -replace ',','')} | #this is my problem
        out-file output.csv -append -encoding ascii
    }
like image 631
astraljack Avatar asked Jun 27 '16 17:06

astraljack


3 Answers

Tony Hinkle's comment is the answer: don't use regex for this (at least not directly on the CSV file).

Your CSV is valid, so you should parse it as such, work on the objects (change the text if you want), then write a new CSV.

Import-Csv -Path .\my.csv | ForEach-Object {
    $_ | ForEach-Object {
        $_ -replace ',',''
    }
} | Export-Csv -Path .\my_new.csv

(this code needs work, specifically the middle as the row will have each column as a property, not an array, but a more complete version of your CSV would make that easier to demonstrate)

like image 86
briantist Avatar answered Nov 17 '22 20:11

briantist


You can try with this regex:

,(?=(\d{3},?)+(?:\.\d{1,3})?")

See Live Demo or in powershell:

% {$_ -replace ',(?=(\d{3},?)+(?:\.\d{1,3})?")','' }

But it's more about the challenge that regex can bring. For proper work, use @briantist answer which is the clean way to do this.

like image 28
Thomas Ayoub Avatar answered Nov 17 '22 22:11

Thomas Ayoub


I would use a simpler regex, and use capture groups instead of the entire capture. I have tested the follow regular expression with your input and found no issues.

% {$_ -replace '([\d]),([\d])','$1$2' }

eg. Find all commas with a number before and after (so that the weird mixed splits dont matter) and replace the comma entirely.

This would have problems if your input has a scenario without that odd mixing of quotes and no quotes.

like image 3
ConstantineK Avatar answered Nov 17 '22 22:11

ConstantineK