How to parse a dirty CSV with Node.js?

Tags:

I'm scratching my head on a CSV file I cannot parse correctly, due to many errors. I extracted a sample you can download here: Test CSV File

Main errors (or what generated an error) are:

Quotes & commas (many errors when trying to parse the file with R)
Empty rows
Unexpected line break inside a field

I first decided to use Regular Expression line by line to clean the data before loading them into R but couldn't solve the problem and it was two slow (200Mo file)

So I decided to use a CSV parser under Node.js with the following code:

'use strict';

const Fs  = require('fs');
const Csv = require('csv');

let input       = 'data_stack.csv';
let readStream  = Fs.createReadStream(input);
let option      = {delimiter: ',', quote: '"', escape: '"', relax: true};

let parser = Csv.parse(option).on('data', (data) => {
    console.log(data)
});

readStream.pipe(parser)

But:

Some rows are parsed correctly (array of strings)
Some are not parsed (all fields are one string)
Some rows are still empty (can be solve by adding skip_empty_lines: true to the options)
I don't know how to handle the unexpected line break.

I don't know how to make this CSV clean, neither with R nor with Node.js.

Any help?

EDIT:

Following @Danny_ds solution, I can parse it correctly. Now I cannot stringify it back correctly.

with console.log(); I get a proper object but when I'm trying to stringify it, I don't get a clean CSV (still have line break and empty rows).

Here is the code I'm using:

'use strict';

const Fs  = require('fs');
const Csv = require('csv');


let input  = 'data_stack.csv';
let output = 'data_output.csv';

let readStream  = Fs.createReadStream(input);
let writeStream = Fs.createWriteStream(output);

let opt  = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};


let transformer = Csv.transform(data => {
    let dirty = data.toString();
    let replace = dirty.replace(/\r\n"/g, '\r\n').replace(/"\r\n/g, '\r\n').replace(/""/g, '"');

    return replace;
});

let parser = Csv.parse(opt);
let stringifier = Csv.stringify();

readStream.pipe(transformer).pipe(parser).pipe(stringifier).pipe(writeStream);

EDIT 2:

Here is the final code that works:

'use strict';

const Fs  = require('fs');
const Csv = require('csv');


let input  = 'data_stack.csv';
let output = 'data_output.csv';

let readStream  = Fs.createReadStream(input);
let writeStream = Fs.createWriteStream(output);

let opt  = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};


let transformer = Csv.transform(data => {
    let dirty = data.toString();
    let replace = dirty
        .replace(/\r\n"/g, '\r\n')
        .replace(/"\r\n/g, '\r\n')
        .replace(/""/g, '"');

    return replace;
});

let parser = Csv.parse(opt);

let cleaner = Csv.transform(data => {
    let clean = data.map(l => {
        if (l.length > 100 || l[0] === '+') {
            return l = "Encoding issue";
        }
        return l;
    });
    return clean;
});

let stringifier = Csv.stringify();

readStream.pipe(transformer).pipe(parser).pipe(cleaner).pipe(stringifier).pipe(writeStream);

Thanks to everyone!

632

asked Jan 23 '16 13:01

Synleb

2 Answers

I don't know how to make this CSV clean, neither with R nor with Node.js.

Actually, it is not as bad as it looks.

This file can easily be converted to a valid csv using the following steps:

replace all "" with ".
replace all \n" with \n.
replace all "\n with \n.

With \n meaning a newline, not the characters "\n" which also appear in your file.

Note that in your example file \n is actually \r\n (0x0d, 0x0a), so depending on the software you use you may need to replace \n in \r\n in the above examples. Also, in your example there is a newline after the last row, so a quote as the last character will be replaced too, but you might want to check this in the original file.

This should produce a valid csv file:

enter image description here

There will still be multiline fields, but that was probably intended. But now those are properly quoted and any decent csv parser should be able to handle multiline fields.

It looks like the original data has had an extra pass for escaping quote characters:

If the original fields contained a , they were quoted, and if those fields already contained quotes, the quotes were escaped with another quote - which is the right way to do.
But then all rows containing a quote seem to have been quoted again (actually converting those rows to one quoted field), and all the quotes inside that row were escaped with another quote.
Obviously, something went wrong with the multiline fields. Quotes were added between the multiple lines too, which is not the right way to do.

127

answered Oct 30 '22 15:10

Danny_ds

The data is not too messed up to work with. There is a clear pattern.

General steps:

Temporarily remove mixed format inner fields (beginning with double(or more) quotes and having all kinds of characters.
Remove quotes from start and end of quoted lines giving clean CSV
Split data into columns
Replace removed fields

Step 1 above is the most important. If you apply this then the problems with new lines, empty rows and quotes and commas disappear. If you look in the data you can see columns 7, 8 and 9 contain mixed data. But it is always delimited by 2 quotes or more. e.g.

good,clean,data,here,"""<-BEGINNING OF FIELD DATA> Oh no
++\n\n<br/>whats happening,, in here, pages of chinese
characters etc END OF FIELD ->""",more,clean,data

Here is a working example based on the file provided:

fs.readFile('./data_stack.csv', (e, data) => {

    // Take out fields that are delimited with double+ quotes
    var dirty = data.toString();
    var matches = dirty.match(/""[\s\S]*?""/g);
    matches.forEach((m,i) => {
        dirty = dirty.replace(m, "<REPL-" + i + ">");
    });

    var cleanData =   dirty
        .split('\n') // get lines

        // ignore first line with column names
        .filter((l, i) => i > 0)

        // remove first and last quotation mark if exists
        .map(l => l[0] === '"' ? l.substring(1, l.length-2) : l) // remove quotes from quoted lines

        // split into columns
        .map(l => l.split(','))

        // return replaced fields back to data (columsn 7,8 and 9)
        .map(col => {

            if (col.length > 9) {
                col[7] = returnField(col[7]);
                col[8] = returnField(col[8]);
                col[9] = returnField(col[9]);
            }
            return col;

            function returnField(f) {
                if (f) {
                    var repls = f.match(/<.*?>/g)
                    if (repls)
                        repls.forEach(m => {
                            var num = +m.split('-')[1].split('>')[0];
                            f = f.replace(m, matches[num]);
                        });
                }
                return f;
            }
        })

    return cleanData
});

Result:

Data looks pretty clean. All rows produce the expected number of columns matching the header (last 2 rows shown):

  ...,
  [ '19403',
    '560e348d2adaffa66f72bfc9',
    'done',
    '276',
    '2015-10-02T07:38:53.172Z',
    '20151002',
    '560e31f69cd6d5059668ee16',
    '""560e336ef3214201030bf7b5""',
    'a+�a��a+�a+�a��a+�a��a+�a��',
    '',
    '560e2e362adaffa66f72bd99',
    '55f8f041b971644d7d861502',
    'foo',
    'foo',
    '[email protected]',
    'bar.com' ],
  [ '20388',
    '560ce1a467cf15ab2cf03482',
    'update',
    '231',
    '2015-10-01T07:32:52.077Z',
    '20151001',
    '560ce1387494620118c1617a',
    '""""""Final test, with a comma""""""',
    '',
    '',
    '55e6dff9b45b14570417a908',
    '55e6e00fb45b14570417a92f',
    'foo',
    'foo',
    '[email protected]',
    'bar.com' ],

answered Oct 30 '22 14:10

chriskelly

Related questions
                            
                                NodeJS: calling global.gc() doesn't reduce memory to minimum?
                            
                                nunjucks function arguments arrive undefined
                            
                                Different window.open(...) behaviour when running Internet Explorer 11 as Administrator
                            
                                Create a view that grows with content and when it hits the screen height begins scrolling with react native?
                            
                                Directive/scope inheritance
                            
                                How to query relational data in Parse JavaScript
                            
                                React-Redux error Invalid prop `children` of type `object` supplied to `Provider`
                            
                                Can I install Angular Material library without Bower or NPM?
                            
                                File format of the VTK file to be used as input for XTK
                            
                                React: AppStore listener must be a function
                            
                                Move google maps logo above custom interface on bottom. JavaScript
                            
                                Paypal integration to Flask application
                            
                                Getting rid of aliasing on a SVG circular element
                            
                                When multiple users are viewing a record and 1 person updates the record, how to notify other the record is updated?
                            
                                calling javaScript from java method
                            
                                How to load pages with JavaScript rather than redirect?
                            
                                How to use two ng-repeat inside a particular tag
                            
                                Use Dropbox Chooser from inside Electron app
                            
                                Disable Predictive Scrolling - Mousewheel (OnScroll) Event fires too often (Touchpad)
                            
                                Mock Timezone in PhantomJS for Jasmine Test

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to parse a dirty CSV with Node.js?

Tags:

javascript

node.js

parsing

csv