This is a matter of efficiency rather than troubleshooting. I have the following code snippet:
# The -R flag restores malformed XML
xmlstarlet -q fo -R <<<"$xml_content" | \
# Delete xml_data
xmlstarlet ed -d "$xml_data" | \
# Delete index
xmlstarlet ed -d "$xml_index" | \
# Delete specific objects
xmlstarlet ed -d "$xml_nodes/objects" | \
# Append new node
xmlstarlet ed -s "$xml_nodes" -t elem -n subnode -v "Hello World" | \
# Add x attribute to node
xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n x -v "0" | \
# Add y attribute to node
xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n y -v "0" | \
# Add z attribute to node
xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n z -v "1" \
> "$output_file"
The variable $xml_content
contains the xml tree of contents and
nodes parsed from a file with size 472.6 MB using the cat
command.
The variable $output_file
as its name indicates, contains the path
to the output file.
According to this brief article which helped come up with this code, it indicates that:
This is a bit ineffeciant since the xml file is parsed and written twice.
In my case, it is parsed and written more than twice (eventually in a loop
over 1000 times).
So, taking the above script, the execution time alone of that short fragment is 4 mins and 7 secs.
Assuming the excessive, repetitive and perhaps inefficient piping together with the file size is why the code runs slow, the more subnodes I ultimately insert/delete will eventually cause it to execute even slower.
I apologise in advance if I might sound monotonous by reiterating myself or by bringing out an old and probably already answered topic, however, I'm really keen to understand how xmlstarlet
works in detail with large XML documents.
UPDATE
As claimed by @Cyrus in his prior answer:
Those two xmlstarlets should do the job:
xmlstarlet -q fo -R <<<"$xml_content" |\ xmlstarlet ed \ -d "$xml_data" \ -d "$xml_index" \ -d "$xml_nodes/objects" \ -s "$xml_nodes" -t elem -n subnode -v "Hello World" \ -i "($xml_nodes)[last()]" -t attr -n x -v "0" \ -i "($xml_nodes)[last()]" -t attr -n y -v "0" \ -i "($xml_nodes)[last()]" -t attr -n z -v "1" > "$output_file"
This produced the following errors:
-:691.84: Attribute x redefined
-:691.84: Attribute z redefined
-:495981.9: xmlSAX2Characters: huge text node: out of memory
-:495981.9: Extra content at the end of the document
I honestly don't know how these errors where produced because I changed the code too often testing various scenarios and potential alternatives, however, this is what did the trick for me:
xmlstarlet ed --omit-decl -L \
-d "$xml_data" \
-d "$xml_index" \
-d "$xml_nodes/objects" \
-s "$xml_nodes" -t elem -n subnode -v "Hello World" \
"$temp_xml_file"
xmlstarlet ed --omit-decl -L \
-i "($xml_nodes)[last()]" -t attr -n x -v "0" \
-i "($xml_nodes)[last()]" -t attr -n y -v "0" \
-i "($xml_nodes)[last()]" -t attr -n z -v "1" \
"$temp_xml_file"
Regarding the actual data
that is inserted, this is what I have at the beginning:
...
<node>
<subnode>A</subnode>
<subnode>B</subnode>
<objects>1</objects>
<objects>2</objects>
<objects>3</objects>
...
</node>
...
Executing the above (split) code gives me what I want:
...
<node>
<subnode>A</subnode>
<subnode>B</subnode>
<subnode x="0" y="0" z="1">Hello World</subnode>
</node>
...
By splitting them, xmlstarlet
is able to insert the attributes
into the newly created node, else it will add them to the last()
instance of the selected Xpath before the --subnode
is even created. To some extent this is still inefficient, nevertheless, the code runs in less than a minute now.
The following code,
xmlstarlet ed --omit-decl -L \
-d "$xml_data" \
-d "$xml_index" \
-d "$xml_nodes/objects" \
-s "$xml_nodes" -t elem -n subnode -v "Hello World" \
-i "($xml_nodes)[last()]" -t attr -n x -v "0" \
-i "($xml_nodes)[last()]" -t attr -n y -v "0" \
-i "($xml_nodes)[last()]" -t attr -n z -v "1" \
"$temp_xml_file"
However, gives me this:
...
<node>
<subnode>A</subnode>
<subnode x="0" y="0" z="1">B</subnode>
<subnode>Hello World</subnode>
</node>
...
By joining the xmlstarlets
into one like in this post also answered by @Cyrus, it somehow first adds the attributes
and then creates the --subnode
where the innerText
is Hello World
.
This is another reference which states that "every edit operation is performed in sequence"
The above article explains exactly what I'm looking for, yet I cannot manage to make it work all in one xmlstarlet ed \
. Alternatively, I tried:
($xml_nodes)[last()]
with $xml_nodes[text() = 'Hello World']
$prev
(or $xstar:prev
) as the argument to -i
like in this answer. [Examples]
-r
to rename the temp node after the attr
are addedAll of the above insert the --subnode
but leave the new element without attributes
.
Note: I run XMLStarlet 1.6.1 on OS X El Capitan v 10.11.3
BONUS
As I mentioned in the beginning I wish to use a loop
like something along these lines:
list="$(tr -d '\r' < $names)"
for name in $list; do
xmlstarlet ed --omit-decl -L \
-d "$xml_data" \
-d "$xml_index" \
-d "$xml_nodes/objects" \
-s "$xml_nodes" -t elem -n subnode -v "$name" \
-i "($xml_nodes)[last()]" -t attr -n x -v "0" \
-i "($xml_nodes)[last()]" -t attr -n y -v "0" \
-i "($xml_nodes)[last()]" -t attr -n z -v "1" \
"$temp_xml_file"
done
The $list
contains over a thousand different names which need to be added with their respective attributes
. The --value
of each attribute may vary with every loop
as well. Given the above model:
What is the fastest and most accurate version of such loop
given that the attributes are added correctly to the corresponding node?
Would it be faster to create the list of nodes in an external txt file and later add those xml elements (inside the txt file) into another XML file. If yes, how? Perhaps with sed
or grep
?
Regarding the last question, I refer to something like this. The node where the xml
from the txt should be added has to be specific, e.g. selectable by XPath at least because I want to edit certain nodes only.
Note: The above model is just an example. The actual loop
will add 26 --subnodes
for each loop
and 3 or 4 attr
for each --subnode
. Thats why it's important for xmlstarlet
to add the attr
properly and not to some other element. They have to be added in order.
Why not use parallel (or sem) so that you can parallelise the job over the number of cores available on the machine? The code I use is parsing an array with 2 variables, that I export in local just to make sure processes are isolated.
for array in "${listofarrays[@]}"; do
local var1;local var2
IFS=, read var1 var2 <<< $array
sem -j +0
<code goes here>
done
sem --wait
unbuffer might help
unbuffer from the expect package
to build a pipe from two commands a, z
unbuffer a | z
to build a pipe from three (or more) commands a, b, z
add the -p option inside the pipe
unbuffer a | unbuffer -p b | z
source: stolen from stackexchange
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With