Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Insert 1000+ nodes and attributes with XMLStarlet - Running Slow

This is a matter of efficiency rather than troubleshooting. I have the following code snippet:

# The -R flag restores malformed XML
xmlstarlet -q fo -R <<<"$xml_content" | \
    # Delete xml_data
    xmlstarlet ed -d "$xml_data" | \
    # Delete index
    xmlstarlet ed -d "$xml_index" | \
    # Delete specific objects
    xmlstarlet ed -d "$xml_nodes/objects" | \
    # Append new node
    xmlstarlet ed -s "$xml_nodes" -t elem -n subnode -v "Hello World" | \
        # Add x attribute to node
        xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n x -v "0" | \
        # Add y attribute to node
        xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n y -v "0" | \
        # Add z attribute to node
        xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n z -v "1" \
            > "$output_file"
  • The variable $xml_content contains the xml tree of contents and
    nodes parsed from a file with size 472.6 MB using the cat command.

  • The variable $output_file as its name indicates, contains the path to the output file.

  • The rest of the variables simply contain the according XPaths I want to edit.

According to this brief article which helped come up with this code, it indicates that:

This is a bit ineffeciant since the xml file is parsed and written twice.

In my case, it is parsed and written more than twice (eventually in a loop over 1000 times).

So, taking the above script, the execution time alone of that short fragment is 4 mins and 7 secs.

Assuming the excessive, repetitive and perhaps inefficient piping together with the file size is why the code runs slow, the more subnodes I ultimately insert/delete will eventually cause it to execute even slower.

I apologise in advance if I might sound monotonous by reiterating myself or by bringing out an old and probably already answered topic, however, I'm really keen to understand how xmlstarlet works in detail with large XML documents.


UPDATE

As claimed by @Cyrus in his prior answer:

Those two xmlstarlets should do the job:

xmlstarlet -q fo -R <<<"$xml_content" |\
  xmlstarlet ed \
    -d "$xml_data" \
    -d "$xml_index" \
    -d "$xml_nodes/objects" \
    -s "$xml_nodes" -t elem -n subnode -v "Hello World" \
    -i "($xml_nodes)[last()]" -t attr -n x -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n y -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n z -v "1" > "$output_file"

This produced the following errors:

  • -:691.84: Attribute x redefined
  • -:691.84: Attribute z redefined
  • -:495981.9: xmlSAX2Characters: huge text node: out of memory
  • -:495981.9: Extra content at the end of the document

I honestly don't know how these errors where produced because I changed the code too often testing various scenarios and potential alternatives, however, this is what did the trick for me:

xmlstarlet ed --omit-decl -L \
    -d "$xml_data" \
    -d "$xml_index" \
    -d "$xml_nodes/objects" \
    -s "$xml_nodes" -t elem -n subnode -v "Hello World" \
    "$temp_xml_file"

xmlstarlet ed --omit-decl -L \
    -i "($xml_nodes)[last()]" -t attr -n x -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n y -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n z -v "1" \
    "$temp_xml_file"

Regarding the actual data that is inserted, this is what I have at the beginning:

...
<node>
    <subnode>A</subnode>
    <subnode>B</subnode>
    <objects>1</objects>
    <objects>2</objects>
    <objects>3</objects>
    ...
</node>
...

Executing the above (split) code gives me what I want:

...
<node>
    <subnode>A</subnode>
    <subnode>B</subnode>
    <subnode x="0" y="0" z="1">Hello World</subnode>
</node>
...

By splitting them, xmlstarlet is able to insert the attributes into the newly created node, else it will add them to the last() instance of the selected Xpath before the --subnode is even created. To some extent this is still inefficient, nevertheless, the code runs in less than a minute now.

The following code,

xmlstarlet ed --omit-decl -L \
    -d "$xml_data" \
    -d "$xml_index" \
    -d "$xml_nodes/objects" \
    -s "$xml_nodes" -t elem -n subnode -v "Hello World" \
    -i "($xml_nodes)[last()]" -t attr -n x -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n y -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n z -v "1" \
    "$temp_xml_file"

However, gives me this:

...
<node>
    <subnode>A</subnode>
    <subnode x="0" y="0" z="1">B</subnode>
    <subnode>Hello World</subnode>
</node>
...

By joining the xmlstarlets into one like in this post also answered by @Cyrus, it somehow first adds the attributes and then creates the --subnode where the innerText is Hello World.

  • Can anyone explain why this strange behaviour is happening??

This is another reference which states that "every edit operation is performed in sequence"

The above article explains exactly what I'm looking for, yet I cannot manage to make it work all in one xmlstarlet ed \. Alternatively, I tried:

  • Replacing ($xml_nodes)[last()] with $xml_nodes[text() = 'Hello World']
  • Using $prev (or $xstar:prev) as the argument to -i like in this answer. [Examples]
  • The temporary element name trick via -r to rename the temp node after the attr are added

All of the above insert the --subnode but leave the new element without attributes.

Note: I run XMLStarlet 1.6.1 on OS X El Capitan v 10.11.3


BONUS

As I mentioned in the beginning I wish to use a loop like something along these lines:

list="$(tr -d '\r' < $names)"

for name in $list; do
    xmlstarlet ed --omit-decl -L \
    -d "$xml_data" \
    -d "$xml_index" \
    -d "$xml_nodes/objects" \
    -s "$xml_nodes" -t elem -n subnode -v "$name" \
    -i "($xml_nodes)[last()]" -t attr -n x -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n y -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n z -v "1" \
    "$temp_xml_file"
done

The $list contains over a thousand different names which need to be added with their respective attributes. The --value of each attribute may vary with every loop as well. Given the above model:

  • What is the fastest and most accurate version of such loop given that the attributes are added correctly to the corresponding node?

  • Would it be faster to create the list of nodes in an external txt file and later add those xml elements (inside the txt file) into another XML file. If yes, how? Perhaps with sed or grep?

Regarding the last question, I refer to something like this. The node where the xml from the txt should be added has to be specific, e.g. selectable by XPath at least because I want to edit certain nodes only.

Note: The above model is just an example. The actual loop will add 26 --subnodes for each loop and 3 or 4 attr for each --subnode. Thats why it's important for xmlstarlet to add the attr properly and not to some other element. They have to be added in order.

like image 786
Ava Barbilla Avatar asked Nov 11 '17 05:11

Ava Barbilla


2 Answers

Why not use parallel (or sem) so that you can parallelise the job over the number of cores available on the machine? The code I use is parsing an array with 2 variables, that I export in local just to make sure processes are isolated.

for array in "${listofarrays[@]}"; do
    local var1;local var2
    IFS=, read var1 var2 <<< $array
    sem -j +0
    <code goes here>
done
sem --wait
like image 66
user1747036 Avatar answered Nov 18 '22 02:11

user1747036


unbuffer might help

unbuffer from the expect package

to build a pipe from two commands a, z

unbuffer a | z

to build a pipe from three (or more) commands a, b, z
add the -p option inside the pipe

unbuffer a | unbuffer -p b | z

source: stolen from stackexchange

like image 31
Mila Nautikus Avatar answered Nov 18 '22 02:11

Mila Nautikus