Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse a dirty CSV with Node.js?

I'm scratching my head on a CSV file I cannot parse correctly, due to many errors. I extracted a sample you can download here: Test CSV File

Main errors (or what generated an error) are:

  • Quotes & commas (many errors when trying to parse the file with R)
  • Empty rows
  • Unexpected line break inside a field

I first decided to use Regular Expression line by line to clean the data before loading them into R but couldn't solve the problem and it was two slow (200Mo file)

So I decided to use a CSV parser under Node.js with the following code:

'use strict';

const Fs  = require('fs');
const Csv = require('csv');

let input       = 'data_stack.csv';
let readStream  = Fs.createReadStream(input);
let option      = {delimiter: ',', quote: '"', escape: '"', relax: true};

let parser = Csv.parse(option).on('data', (data) => {
    console.log(data)
});

readStream.pipe(parser)

But:

  • Some rows are parsed correctly (array of strings)
  • Some are not parsed (all fields are one string)
  • Some rows are still empty (can be solve by adding skip_empty_lines: true to the options)
  • I don't know how to handle the unexpected line break.

I don't know how to make this CSV clean, neither with R nor with Node.js.

Any help?

EDIT:

Following @Danny_ds solution, I can parse it correctly. Now I cannot stringify it back correctly.

with console.log(); I get a proper object but when I'm trying to stringify it, I don't get a clean CSV (still have line break and empty rows).

Here is the code I'm using:

'use strict';

const Fs  = require('fs');
const Csv = require('csv');


let input  = 'data_stack.csv';
let output = 'data_output.csv';

let readStream  = Fs.createReadStream(input);
let writeStream = Fs.createWriteStream(output);

let opt  = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};


let transformer = Csv.transform(data => {
    let dirty = data.toString();
    let replace = dirty.replace(/\r\n"/g, '\r\n').replace(/"\r\n/g, '\r\n').replace(/""/g, '"');

    return replace;
});

let parser = Csv.parse(opt);
let stringifier = Csv.stringify();

readStream.pipe(transformer).pipe(parser).pipe(stringifier).pipe(writeStream);

EDIT 2:

Here is the final code that works:

'use strict';

const Fs  = require('fs');
const Csv = require('csv');


let input  = 'data_stack.csv';
let output = 'data_output.csv';

let readStream  = Fs.createReadStream(input);
let writeStream = Fs.createWriteStream(output);

let opt  = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};


let transformer = Csv.transform(data => {
    let dirty = data.toString();
    let replace = dirty
        .replace(/\r\n"/g, '\r\n')
        .replace(/"\r\n/g, '\r\n')
        .replace(/""/g, '"');

    return replace;
});

let parser = Csv.parse(opt);

let cleaner = Csv.transform(data => {
    let clean = data.map(l => {
        if (l.length > 100 || l[0] === '+') {
            return l = "Encoding issue";
        }
        return l;
    });
    return clean;
});

let stringifier = Csv.stringify();

readStream.pipe(transformer).pipe(parser).pipe(cleaner).pipe(stringifier).pipe(writeStream);

Thanks to everyone!

like image 632
Synleb Avatar asked Jan 23 '16 13:01

Synleb


People also ask

How do I parse a csv file in node JS?

You will use the fs module's createReadStream() method to read the data from the CSV file and create a readable stream. You will then pipe the stream to another stream initialized with the csv-parse module to parse the chunks of data. Once the chunks of data have been parsed, you can log them in the console.

Are CSV files easy to parse?

Parsing CSV files in Python is quite easy. Python has an inbuilt CSV library which provides the functionality of both readings and writing the data from and to CSV files. There are a variety of formats available for CSV files in the library which makes data processing user-friendly.


2 Answers

I don't know how to make this CSV clean, neither with R nor with Node.js.

Actually, it is not as bad as it looks.

This file can easily be converted to a valid csv using the following steps:

  • replace all "" with ".
  • replace all \n" with \n.
  • replace all "\n with \n.

With \n meaning a newline, not the characters "\n" which also appear in your file.

Note that in your example file \n is actually \r\n (0x0d, 0x0a), so depending on the software you use you may need to replace \n in \r\n in the above examples. Also, in your example there is a newline after the last row, so a quote as the last character will be replaced too, but you might want to check this in the original file.

This should produce a valid csv file:

enter image description here enter image description here

There will still be multiline fields, but that was probably intended. But now those are properly quoted and any decent csv parser should be able to handle multiline fields.


It looks like the original data has had an extra pass for escaping quote characters:

  • If the original fields contained a , they were quoted, and if those fields already contained quotes, the quotes were escaped with another quote - which is the right way to do.

  • But then all rows containing a quote seem to have been quoted again (actually converting those rows to one quoted field), and all the quotes inside that row were escaped with another quote.

  • Obviously, something went wrong with the multiline fields. Quotes were added between the multiple lines too, which is not the right way to do.

like image 127
Danny_ds Avatar answered Oct 30 '22 15:10

Danny_ds


The data is not too messed up to work with. There is a clear pattern.

General steps:

  1. Temporarily remove mixed format inner fields (beginning with double(or more) quotes and having all kinds of characters.
  2. Remove quotes from start and end of quoted lines giving clean CSV
  3. Split data into columns
  4. Replace removed fields

Step 1 above is the most important. If you apply this then the problems with new lines, empty rows and quotes and commas disappear. If you look in the data you can see columns 7, 8 and 9 contain mixed data. But it is always delimited by 2 quotes or more. e.g.

good,clean,data,here,"""<-BEGINNING OF FIELD DATA> Oh no
++\n\n<br/>whats happening,, in here, pages of chinese
characters etc END OF FIELD ->""",more,clean,data

Here is a working example based on the file provided:

fs.readFile('./data_stack.csv', (e, data) => {

    // Take out fields that are delimited with double+ quotes
    var dirty = data.toString();
    var matches = dirty.match(/""[\s\S]*?""/g);
    matches.forEach((m,i) => {
        dirty = dirty.replace(m, "<REPL-" + i + ">");
    });

    var cleanData =   dirty
        .split('\n') // get lines

        // ignore first line with column names
        .filter((l, i) => i > 0)

        // remove first and last quotation mark if exists
        .map(l => l[0] === '"' ? l.substring(1, l.length-2) : l) // remove quotes from quoted lines

        // split into columns
        .map(l => l.split(','))

        // return replaced fields back to data (columsn 7,8 and 9)
        .map(col => {

            if (col.length > 9) {
                col[7] = returnField(col[7]);
                col[8] = returnField(col[8]);
                col[9] = returnField(col[9]);
            }
            return col;

            function returnField(f) {
                if (f) {
                    var repls = f.match(/<.*?>/g)
                    if (repls)
                        repls.forEach(m => {
                            var num = +m.split('-')[1].split('>')[0];
                            f = f.replace(m, matches[num]);
                        });
                }
                return f;
            }
        })

    return cleanData
});

Result:

Data looks pretty clean. All rows produce the expected number of columns matching the header (last 2 rows shown):

  ...,
  [ '19403',
    '560e348d2adaffa66f72bfc9',
    'done',
    '276',
    '2015-10-02T07:38:53.172Z',
    '20151002',
    '560e31f69cd6d5059668ee16',
    '""560e336ef3214201030bf7b5""',
    'a+�a��a+�a+�a��a+�a��a+�a��',
    '',
    '560e2e362adaffa66f72bd99',
    '55f8f041b971644d7d861502',
    'foo',
    'foo',
    '[email protected]',
    'bar.com' ],
  [ '20388',
    '560ce1a467cf15ab2cf03482',
    'update',
    '231',
    '2015-10-01T07:32:52.077Z',
    '20151001',
    '560ce1387494620118c1617a',
    '""""""Final test, with a comma""""""',
    '',
    '',
    '55e6dff9b45b14570417a908',
    '55e6e00fb45b14570417a92f',
    'foo',
    'foo',
    '[email protected]',
    'bar.com' ],
like image 25
chriskelly Avatar answered Oct 30 '22 14:10

chriskelly