I'm scratching my head on a CSV file I cannot parse correctly, due to many errors. I extracted a sample you can download here: Test CSV File
Main errors (or what generated an error) are:
I first decided to use Regular Expression line by line to clean the data before loading them into R but couldn't solve the problem and it was two slow (200Mo file)
So I decided to use a CSV parser under Node.js with the following code:
'use strict';
const Fs = require('fs');
const Csv = require('csv');
let input = 'data_stack.csv';
let readStream = Fs.createReadStream(input);
let option = {delimiter: ',', quote: '"', escape: '"', relax: true};
let parser = Csv.parse(option).on('data', (data) => {
console.log(data)
});
readStream.pipe(parser)
But:
skip_empty_lines: true
to the options)I don't know how to make this CSV clean, neither with R nor with Node.js.
Any help?
EDIT:
Following @Danny_ds solution, I can parse it correctly. Now I cannot stringify it back correctly.
with console.log();
I get a proper object but when I'm trying to stringify it, I don't get a clean CSV (still have line break and empty rows).
Here is the code I'm using:
'use strict';
const Fs = require('fs');
const Csv = require('csv');
let input = 'data_stack.csv';
let output = 'data_output.csv';
let readStream = Fs.createReadStream(input);
let writeStream = Fs.createWriteStream(output);
let opt = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};
let transformer = Csv.transform(data => {
let dirty = data.toString();
let replace = dirty.replace(/\r\n"/g, '\r\n').replace(/"\r\n/g, '\r\n').replace(/""/g, '"');
return replace;
});
let parser = Csv.parse(opt);
let stringifier = Csv.stringify();
readStream.pipe(transformer).pipe(parser).pipe(stringifier).pipe(writeStream);
EDIT 2:
Here is the final code that works:
'use strict';
const Fs = require('fs');
const Csv = require('csv');
let input = 'data_stack.csv';
let output = 'data_output.csv';
let readStream = Fs.createReadStream(input);
let writeStream = Fs.createWriteStream(output);
let opt = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};
let transformer = Csv.transform(data => {
let dirty = data.toString();
let replace = dirty
.replace(/\r\n"/g, '\r\n')
.replace(/"\r\n/g, '\r\n')
.replace(/""/g, '"');
return replace;
});
let parser = Csv.parse(opt);
let cleaner = Csv.transform(data => {
let clean = data.map(l => {
if (l.length > 100 || l[0] === '+') {
return l = "Encoding issue";
}
return l;
});
return clean;
});
let stringifier = Csv.stringify();
readStream.pipe(transformer).pipe(parser).pipe(cleaner).pipe(stringifier).pipe(writeStream);
Thanks to everyone!
You will use the fs module's createReadStream() method to read the data from the CSV file and create a readable stream. You will then pipe the stream to another stream initialized with the csv-parse module to parse the chunks of data. Once the chunks of data have been parsed, you can log them in the console.
Parsing CSV files in Python is quite easy. Python has an inbuilt CSV library which provides the functionality of both readings and writing the data from and to CSV files. There are a variety of formats available for CSV files in the library which makes data processing user-friendly.
I don't know how to make this CSV clean, neither with R nor with Node.js.
Actually, it is not as bad as it looks.
This file can easily be converted to a valid csv using the following steps:
""
with "
.\n"
with \n
."\n
with \n
.With \n
meaning a newline, not the characters "\n
" which also appear in your file.
Note that in your example file \n
is actually \r\n
(0x0d
, 0x0a
), so depending on the software you use you may need to replace \n
in \r\n
in the above examples. Also, in your example there is a newline after the last row, so a quote as the last character will be replaced too, but you might want to check this in the original file.
This should produce a valid csv file:
There will still be multiline fields, but that was probably intended. But now those are properly quoted and any decent csv parser should be able to handle multiline fields.
It looks like the original data has had an extra pass for escaping quote characters:
If the original fields contained a ,
they were quoted, and if those fields already contained quotes, the quotes were escaped with another quote - which is the right way to do.
But then all rows containing a quote seem to have been quoted again (actually converting those rows to one quoted field), and all the quotes inside that row were escaped with another quote.
Obviously, something went wrong with the multiline fields. Quotes were added between the multiple lines too, which is not the right way to do.
The data is not too messed up to work with. There is a clear pattern.
General steps:
Step 1 above is the most important. If you apply this then the problems with new lines, empty rows and quotes and commas disappear. If you look in the data you can see columns 7, 8 and 9 contain mixed data. But it is always delimited by 2 quotes or more. e.g.
good,clean,data,here,"""<-BEGINNING OF FIELD DATA> Oh no
++\n\n<br/>whats happening,, in here, pages of chinese
characters etc END OF FIELD ->""",more,clean,data
Here is a working example based on the file provided:
fs.readFile('./data_stack.csv', (e, data) => {
// Take out fields that are delimited with double+ quotes
var dirty = data.toString();
var matches = dirty.match(/""[\s\S]*?""/g);
matches.forEach((m,i) => {
dirty = dirty.replace(m, "<REPL-" + i + ">");
});
var cleanData = dirty
.split('\n') // get lines
// ignore first line with column names
.filter((l, i) => i > 0)
// remove first and last quotation mark if exists
.map(l => l[0] === '"' ? l.substring(1, l.length-2) : l) // remove quotes from quoted lines
// split into columns
.map(l => l.split(','))
// return replaced fields back to data (columsn 7,8 and 9)
.map(col => {
if (col.length > 9) {
col[7] = returnField(col[7]);
col[8] = returnField(col[8]);
col[9] = returnField(col[9]);
}
return col;
function returnField(f) {
if (f) {
var repls = f.match(/<.*?>/g)
if (repls)
repls.forEach(m => {
var num = +m.split('-')[1].split('>')[0];
f = f.replace(m, matches[num]);
});
}
return f;
}
})
return cleanData
});
Data looks pretty clean. All rows produce the expected number of columns matching the header (last 2 rows shown):
...,
[ '19403',
'560e348d2adaffa66f72bfc9',
'done',
'276',
'2015-10-02T07:38:53.172Z',
'20151002',
'560e31f69cd6d5059668ee16',
'""560e336ef3214201030bf7b5""',
'a+�a��a+�a+�a��a+�a��a+�a��',
'',
'560e2e362adaffa66f72bd99',
'55f8f041b971644d7d861502',
'foo',
'foo',
'[email protected]',
'bar.com' ],
[ '20388',
'560ce1a467cf15ab2cf03482',
'update',
'231',
'2015-10-01T07:32:52.077Z',
'20151001',
'560ce1387494620118c1617a',
'""""""Final test, with a comma""""""',
'',
'',
'55e6dff9b45b14570417a908',
'55e6e00fb45b14570417a92f',
'foo',
'foo',
'[email protected]',
'bar.com' ],
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With