Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LOAD DATA from CSV file where doublequote was used as the escape character

I have a bunch of CSV data that I need to load into a MySQL database. Well, CSV-ish, perhaps. (edit: actually, it looks like the stuff described in RFC 4180)

Each row is a list of comma-separated doublequoted strings. To escape any doublequotes that appear within a column value, double doublequotes are used. Backslashes are allowed to represent themselves.

For example, the line:

"", "\wave\", ""hello,"" said the vicar", "what are ""scare-quotes"" good for?", "I'm reading ""Bossypants"""

if parsed into JSON should be:

[ "", "\\wave\\", "\"hello,\" said the vicar", "what are \"scare-quotes\" good for?", "I'm reading \"Bossypants\"" ]

I'm trying to use the LOAD DATA to read the CSV in, but I'm running into some weird behaviour.


As an example, consider if I have a simple two column table

shell% mysql exampledb -e "describe person"
+-------+-----------+------+-----+---------+-------+
| Field | Type      | Null | Key | Default | Extra |
+-------+-----------+------+-----+---------+-------+
| ID    | int(11)   | YES  |     | NULL    |       |
| UID   | char(255) | YES  |     | NULL    |       |
+-------+-----------+------+-----+---------+-------+
shell%

If the first non-header line of my input file ends on "":

shell% cat temp-1.csv
"ID","UID"
"9",""
"0","Steve the Pirate"
"1","\Alpha"
"2","Hoban ""Wash"" Washburne"
"3","Pastor Veal"
"4","Tucker"
"10",""
"5","Simon"
"6","Sonny"
"7","Wat\"

I can either load every non-header line but the first:

mysql> DELETE FROM person;
Query OK, 0 rows affected (0.00 sec)

mysql> LOAD DATA
          LOCAL INFILE 'temp-1.csv'
          INTO TABLE person
          FIELDS
            TERMINATED BY ','
            ENCLOSED BY '"'
            ESCAPED BY '"'
          LINES
            TERMINATED BY '\n'
          IGNORE 1 LINES
       ;
Query OK, 9 rows affected (0.00 sec)
Records: 9  Deleted: 0  Skipped: 0  Warnings: 0

mysql> SELECT * FROM person;
+------+------------------------+
| ID   | UID                    |
+------+------------------------+
|    0 | Steve the Pirate       |
|   10 |                        |
|    1 | \Alpha                 |
|    2 | Hoban "Wash" Washburne |
|    3 | Pastor Veal            |
|    4 | Tucker                 |
|    5 | Simon                  |
|    6 | Sonny                  |
|    7 | Wat\                   |
+------+------------------------+
9 rows in set (0.00 sec)

Or I can load all lines including the header:

mysql> DELETE FROM person;
Query OK, 9 rows affected (0.00 sec)

mysql> LOAD DATA
          LOCAL INFILE 'temp-1.csv'
          INTO TABLE person
          FIELDS
            TERMINATED BY ','
            ENCLOSED BY '"'
            ESCAPED BY '"'
          LINES
            TERMINATED BY '\n'
          IGNORE 0 LINES
       ;
Query OK, 11 rows affected, 1 warning (0.01 sec)
Records: 11  Deleted: 0  Skipped: 0  Warnings: 1

mysql> show warnings;
+---------+------+--------------------------------------------------------+
| Level   | Code | Message                                                |
+---------+------+--------------------------------------------------------+
| Warning | 1366 | Incorrect integer value: 'ID' for column 'ID' at row 1 |
+---------+------+--------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM person;
+------+------------------------+
| ID   | UID                    |
+------+------------------------+
|    0 | UID                    |
|    9 |                        |
|    0 | Steve the Pirate       |
|   10 |                        |
|    1 | \Alpha                 |
|    2 | Hoban "Wash" Washburne |
|    3 | Pastor Veal            |
|    4 | Tucker                 |
|    5 | Simon                  |
|    6 | Sonny                  |
|    7 | Wat\                   |
+------+------------------------+
11 rows in set (0.00 sec)

If no lines of my input file end on "":

shell% cat temp-2.csv
"ID","UID"
"0","Steve the Pirate"
"1","\Alpha"
"2","Hoban ""Wash"" Washburne"
"3","Pastor Veal"
"4","Tucker"
"5","Simon"
"6","Sonny"
"7","Wat\"

then I can either load no lines:

mysql> DELETE FROM person;
Query OK, 11 rows affected (0.00 sec)

mysql> LOAD DATA
          LOCAL INFILE 'temp-2.csv'
          INTO TABLE person
          FIELDS
            TERMINATED BY ','
            ENCLOSED BY '"'
            ESCAPED BY '"'
          LINES
            TERMINATED BY '\n'
          IGNORE 1 LINES
       ;
Query OK, 0 rows affected (0.00 sec)
Records: 0  Deleted: 0  Skipped: 0  Warnings: 0

mysql> SELECT * FROM person;
Empty set (0.00 sec)

Or I can load all the lines including the header:

mysql> DELETE FROM person;
Query OK, 0 rows affected (0.00 sec)

mysql> LOAD DATA
          LOCAL INFILE 'temp-2.csv'
          INTO TABLE person
          FIELDS
            TERMINATED BY ','
            ENCLOSED BY '"'
            ESCAPED BY '"'
          LINES
            TERMINATED BY '\n'
          IGNORE 0 LINES
       ;
Query OK, 9 rows affected, 1 warning (0.03 sec)
Records: 9  Deleted: 0  Skipped: 0  Warnings: 1

mysql> show warnings;
+---------+------+--------------------------------------------------------+
| Level   | Code | Message                                                |
+---------+------+--------------------------------------------------------+
| Warning | 1366 | Incorrect integer value: 'ID' for column 'ID' at row 1 |
+---------+------+--------------------------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT * FROM person;
+------+------------------------+
| ID   | UID                    |
+------+------------------------+
|    0 | UID                    |
|    0 | Steve the Pirate       |
|    1 | \Alpha                 |
|    2 | Hoban "Wash" Washburne |
|    3 | Pastor Veal            |
|    4 | Tucker                 |
|    5 | Simon                  |
|    6 | Sonny                  |
|    7 | Wat\                   |
+------+------------------------+
9 rows in set (0.00 sec)

So now that I've discovered many ways to do it wrong, how can I use LOAD DATA to import the data from these files into my database?

like image 692
rampion Avatar asked Jun 11 '13 20:06

rampion


People also ask

How do I escape characters in a CSV file?

By default, the escape character is a " (double quote) for CSV-formatted files. If you want to use a different escape character, use the ESCAPE clause of COPY , CREATE EXTERNAL TABLE or gpload to declare a different escape character.

How do I escape a quote from a CSV file?

There are 2 accepted ways of escaping double-quotes in a CSV file. One is using a 2 consecutive double-quotes to denote 1 literal double-quote in the data. The alternative is using a backslash and a single double-quote.

How do you handle double quotes in CSV?

Yes. You can import double quotation marks using CSV files and import maps by escaping the double quotation marks. To escape the double quotation marks, enclose them within another double quotation mark.

How do I escape a semicolon in a CSV file?

Actually, the only value to escape is double quotes symbol. All other cell content gets into it and displayed correctly in Excel. Checked with various versions of Excel and ODBC CSV parsers in Cyrillic locale under Windows. Save this answer.


1 Answers

According to the documentation for LOAD DATA, treating doubled double quotes as a double quote is the default:

If the field begins with the ENCLOSED BY character, instances of that character are recognized as terminating a field value only if followed by the field or line TERMINATED BY sequence. To avoid ambiguity, occurrences of the ENCLOSED BY character within a field value can be doubled and are interpreted as a single instance of the character. For example, if ENCLOSED BY '"' is specified, quotation marks are handled as shown here:

"The ""BIG"" boss"  -> The "BIG" boss
The "BIG" boss      -> The "BIG" boss
The ""BIG"" boss    -> The ""BIG"" boss

So all I need to do is disable interpreting \ as an escape character, by using ESCAPED BY ''.

LOAD DATA
  LOCAL INFILE 'temp-1.csv'
  INTO TABLE person
  FIELDS
    TERMINATED BY ','
    ENCLOSED BY '"'
    ESCAPED BY ''
  LINES
    TERMINATED BY '\n'
  IGNORE 1 LINES
;
like image 73
rampion Avatar answered Oct 18 '22 16:10

rampion