Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Import XML files to PostgreSQL

I do have a lot of XML files I would like to import in the table xml_data:

create table xml_data(result xml);

To do this I have a simple bash script with loop:

#!/bin/sh
FILES=/folder/with/xml/files/*.xml
for f in $FILES
do
  psql psql -d mydb -h myhost -U usr -c \'\copy xml_data from $f \'
done

However this will try to import each line of every file as separate row. This leads to error:

ERROR:  invalid XML content
CONTEXT:  COPY address_results, line 1, column result: "<?xml version="1.0" encoding="UTF-8"?>"

I understand why it fails, but cannot figure out how to make \copy to import the whole file at once into single row.

like image 388
Tomas Greif Avatar asked Sep 25 '13 14:09

Tomas Greif


2 Answers

Necromancing: For those that need a working example:

DO $$
   DECLARE myxml xml;
BEGIN

myxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'));

DROP TABLE IF EXISTS mytable;
CREATE TEMP TABLE mytable AS 

SELECT 
     (xpath('//ID/text()', x))[1]::text AS id
    ,(xpath('//Name/text()', x))[1]::text AS Name 
    ,(xpath('//RFC/text()', x))[1]::text AS RFC
    ,(xpath('//Text/text()', x))[1]::text AS Text
    ,(xpath('//Desc/text()', x))[1]::text AS Desc
FROM unnest(xpath('//record', myxml)) x
;

END$$;


SELECT * FROM mytable;

Or with less noise

SELECT 
     (xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id
    ,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name 
    ,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC
    ,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text
    ,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc
    ,myTempTable.myXmlColumn as myXmlElement
FROM unnest(
    xpath
    (    '//record'
        ,XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'))
    )
) AS myTempTable(myXmlColumn)
;

With this example XML file (MyData.xml):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data-set>
    <record>
        <ID>1</ID>
        <Name>A</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Address record</Text>
        <Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>
    </record>
    <record>
        <ID>2</ID>
        <Name>NS</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Name server record</Text>
        <Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>
    </record>
</data-set>

Note:
MyData.xml needs to be in the PG_Data directory (the parent-directory of the pg_stat directory).
e.g. /var/lib/postgresql/9.3/main/MyData.xml
This requires PostGreSQL 9.1+

Overall, you can achive it fileless, like this:

SELECT 
     (xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id
    ,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name 
    ,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC
    ,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text
    ,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc
    ,myTempTable.myXmlColumn as myXmlElement 
    -- Source: https://en.wikipedia.org/wiki/List_of_DNS_record_types
FROM unnest(xpath('//record', 
 CAST('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data-set>
    <record>
        <ID>1</ID>
        <Name>A</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Address record</Text>
        <Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>
    </record>
    <record>
        <ID>2</ID>
        <Name>NS</Name>
        <RFC>RFC 1035[1]</RFC>
        <Text>Name server record</Text>
        <Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>
    </record>
</data-set>
' AS xml)   
)) AS myTempTable(myXmlColumn)
;

Note that unlike in MS-SQL, xpath text() returns NULL on a NULL value, and not an empty string.
If for whatever reason you need to explicitly check for the existence of NULL, you can use [not(@xsi:nil="true")], to which you need to pass an array of namespaces, because otherwise, you get an error (however, you can omit all namespaces but xsi).

SELECT 
     (xpath('//xmlEncodeTest[1]/text()', myTempTable.myXmlColumn))[1]::text AS c1

    ,(
    xpath('//xmlEncodeTest[1][not(@xsi:nil="true")]/text()', myTempTable.myXmlColumn
    ,
    ARRAY[
        -- ARRAY['xmlns','http://www.w3.org/1999/xhtml'], -- defaultns
        ARRAY['xsi','http://www.w3.org/2001/XMLSchema-instance'],
        ARRAY['xsd','http://www.w3.org/2001/XMLSchema'],        
        ARRAY['svg','http://www.w3.org/2000/svg'],
        ARRAY['xsl','http://www.w3.org/1999/XSL/Transform']
    ]
    )
    )[1]::text AS c22


    ,(xpath('//nixda[1]/text()', myTempTable.myXmlColumn))[1]::text AS c2 
    --,myTempTable.myXmlColumn as myXmlElement
    ,xmlexists('//xmlEncodeTest[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1e
    ,xmlexists('//nixda[1]' PASSING BY REF myTempTable.myXmlColumn) AS c2e
    ,xmlexists('//xmlEncodeTestAbc[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1ea
FROM unnest(xpath('//row', 
     CAST('<?xml version="1.0" encoding="utf-8"?>
    <table xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
      <row>
        <xmlEncodeTest xsi:nil="true" />
        <nixda>noob</nixda>
      </row>
    </table>
    ' AS xml)   
    )
) AS myTempTable(myXmlColumn)
;

You can also check if a field is contained in an XML-text, by doing

 ,xmlexists('//xmlEncodeTest[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1e

for example when you pass an XML-value to a stored-procedure/function for CRUD. (see above)

Also, note that the correct way to pass a null-value in XML is <elementName xsi:nil="true" /> and not <elementName /> or nothing. There is no correct way to pass NULL in attributes (you can only omit the attribute, but then it gets difficult/slow to infer the number of columns and their names in a large dataset).

e.g.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<table>
    <row column1="a" column2="3" />
    <row column1="b" column2="4" column3="true" />
</table>

(is more compact, but very bad if you need to import it, especially if from XML-files with multiple GB of data - see a wonderful example of that in the stackoverflow data dump)

SELECT 
     myTempTable.myXmlColumn
    ,(xpath('//@column1', myTempTable.myXmlColumn))[1]::text AS c1
    ,(xpath('//@column2', myTempTable.myXmlColumn))[1]::text AS c2
    ,(xpath('//@column3', myTempTable.myXmlColumn))[1]::text AS c3
    ,xmlexists('//@column3' PASSING BY REF myTempTable.myXmlColumn) AS c3e
    ,case when (xpath('//@column3', myTempTable.myXmlColumn))[1]::text is null then 1 else 0 end AS is_null 
FROM unnest(xpath('//row', '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<table>
    <row column1="a" column2="3" />
    <row column1="b" column2="4" column3="true" />
</table>'
))  AS myTempTable(myXmlColumn) 
like image 125
Stefan Steiger Avatar answered Sep 20 '22 01:09

Stefan Steiger


I would try a different approach: read the XML file directly into variable inside a plpgsql function and proceed from there. Should be a lot faster and a lot more robust.

CREATE OR REPLACE FUNCTION f_sync_from_xml()
  RETURNS boolean AS
$BODY$
DECLARE
    myxml    xml;
    datafile text := 'path/to/my_file.xml';
BEGIN
   myxml := pg_read_file(datafile, 0, 100000000);  -- arbitrary 100 MB max.

   CREATE TEMP TABLE tmp AS
   SELECT (xpath('//some_id/text()', x))[1]::text AS id
   FROM   unnest(xpath('/xml/path/to/datum', myxml)) x;
   ...

You need superuser privileges, and file must be local to the DB server, in an accessible directory.
Complete code example with more explanation and links:

  • XML data to PostgreSQL database
like image 45
Erwin Brandstetter Avatar answered Sep 19 '22 01:09

Erwin Brandstetter