Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL*Loader: Dealing with delimiter characters in data

I am loading some data to Oracle via SQLLDR. The source file is "pipe delimited".

FIELDS TERMINATED BY '|'

But some records contain pipe character in data, and not as separator. So it breaks correct loading of records as it understands indata pipe characters as field terminator.

Can you point me a direction to solve this issue?

Data file is about 9 GB, so it is hard to edit manually.

For example,

Loaded row:

ABC|1234567|STR 9 R 25|98734959,32|28.12.2011

Rejected Row:

DE4|2346543|WE| 454|956584,84|28.11.2011

Error:

Rejected - Error on table HSX, column DATE_N.
ORA-01847: day of month must be between 1 and last day of month

DATE_N column is the last one.

like image 603
bonsvr Avatar asked Jan 18 '23 17:01

bonsvr


2 Answers

You could not use any separator, and do something like:

field FILLER,
col1 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\1')",
col2 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\2')",
col3 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\3')",
col4 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\4')",
col5 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\5')",
col6 EXPRESSION "REGEXP_REPLACE(:field,'^([^|]*)\\|([^|]*)\\|(.*)\\|([^|]*)\\|([^|]*)\\|([^|]*)$', '\\6')"

This regexp takes six capture groups (inside parentheses) separated by a vertical bar (I had to escape it because otherwise it means OR in regexp). All groups except the third cannot contain a vertical bar ([^|]*), the third group may contain anything (.*), and the regexp must span from beginning to end of the line (^ and $).

This way we are sure that the third group will eat all superfluous separators. This only works because you've only one field that may contain separators. If you want to proofcheck you can for example specify that the fourth group starts with a digit (include \d at the beginning of the fourth parenthesized block).

I have doubled all backslashes because we are inside a double-quoted expression, but I am not really sure that I ought to.

like image 189
Benoit Avatar answered Jan 28 '23 04:01

Benoit


It looks to me that it's not really possible for SQL*Loader to handle your file because of the third field which: can contain the delimiter, is not surrounded by quotes and is of a variable length. Instead, if the data you provide is an accurate example then I can provide a sample workaround. First, create a table with one column of VARCHAR2 with length the same as the maximum length of any one line in your file. Then just load the entire file into this table. From there you can extract each column with a query such as:

with CTE as
       (select 'ABC|1234567|STR 9 R 25|98734959,32|28.12.2011' as CTETXT
          from dual
        union all
        select 'DE4|2346543|WE| 454|956584,84|28.11.2011' from dual)
select substr(CTETXT, 1, instr(CTETXT, '|') - 1) as COL1
      ,substr(CTETXT
             ,instr(CTETXT, '|', 1, 1) + 1
             ,instr(CTETXT, '|', 1, 2) - instr(CTETXT, '|', 1, 1) - 1)
         as COL2
      ,substr(CTETXT
             ,instr(CTETXT, '|', 1, 2) + 1
             ,instr(CTETXT, '|', -1, 1) - instr(CTETXT, '|', 1, 2) - 1)
         as COL3
      ,substr(CTETXT, instr(CTETXT, '|', -1, 1) + 1) as COL4
  from CTE

It's not perfect (though it may be adaptable to SQL*Loader) but would need a bit of work if you have more columns or if your third field is not what I think it is. But, it's a start.

like image 30
John Doyle Avatar answered Jan 28 '23 06:01

John Doyle