Press Control + Alt + T . This opens a terminal window. Navigate to the directory containing your XML file. You'll use the cd command to do this.
Definition. XML parsing is the process of reading an XML document and providing an interface to the user application for accessing the document. An XML parser is a software apparatus that accomplishes such tasks.
This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...
rdom () { local IFS=\> ; read -d \< E C ;}
Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:
read_dom () {
local IFS=\>
read -d \< ENTITY CONTENT
}
Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:
<tag>value</tag>
The first call to read_dom
get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag
and CONTENT=value
. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag
and CONTENT=
. The fourth call will return a non-zero status because we've reached the end of file.
Now his while loop cleaned up a bit to match the above:
while read_dom; do
if [[ $ENTITY = "title" ]]; then
echo $CONTENT
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom
function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).
Now given the following (similar to what you get from listing a bucket on S3) for input.xml
:
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>sth-items</Name>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>[email protected]</Key>
<LastModified>2011-07-25T22:23:04.000Z</LastModified>
<ETag>"0032a28286680abee71aed5d059c6a09"</ETag>
<Size>1785</Size>
<StorageClass>STANDARD</StorageClass>
</Contents>
</ListBucketResult>
and the following loop:
while read_dom; do
echo "$ENTITY => $CONTENT"
done < input.xml
You should get:
=>
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" =>
Name => sth-items
/Name =>
IsTruncated => false
/IsTruncated =>
Contents =>
Key => [email protected]
/Key =>
LastModified => 2011-07-25T22:23:04.000Z
/LastModified =>
ETag => "0032a28286680abee71aed5d059c6a09"
/ETag =>
Size => 1785
/Size =>
StorageClass => STANDARD
/StorageClass =>
/Contents =>
So if we wrote a while
loop like Yuzem's:
while read_dom; do
if [[ $ENTITY = "Key" ]] ; then
echo $CONTENT
fi
done < input.xml
We'd get a listing of all the files in the S3 bucket.
EDIT
If for some reason local IFS=\>
doesn't work for you and you set it globally, you should reset it at the end of the function like:
read_dom () {
ORIGINAL_IFS=$IFS
IFS=\>
read -d \< ENTITY CONTENT
IFS=$ORIGINAL_IFS
}
Otherwise, any line splitting you do later in the script will be messed up.
EDIT 2
To split out attribute name/value pairs you can augment the read_dom()
like so:
read_dom () {
local IFS=\>
read -d \< ENTITY CONTENT
local ret=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $ret
}
Then write your function to parse and get the data you want like this:
parse_dom () {
if [[ $TAG_NAME = "foo" ]] ; then
eval local $ATTRIBUTES
echo "foo size is: $size"
elif [[ $TAG_NAME = "bar" ]] ; then
eval local $ATTRIBUTES
echo "bar type is: $type"
fi
}
Then while you read_dom
call parse_dom
:
while read_dom; do
parse_dom
done
Then given the following example markup:
<example>
<bar size="bar_size" type="metal">bars content</bar>
<foo size="1789" type="unknown">foos content</foo>
</example>
You should get this output:
$ cat example.xml | ./bash_xml.sh
bar type is: metal
foo size is: 1789
EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:
read_dom () {
local IFS=\>
read -d \< ENTITY CONTENT
local RET=$?
TAG_NAME=${ENTITY%% *}
ATTRIBUTES=${ENTITY#* }
return $RET
}
I don't see any reason why that shouldn't work
You can do that very easily using only bash. You only have to add this function:
rdom () { local IFS=\> ; read -d \< E C ;}
Now you can use rdom like read but for html documents. When called rdom will assign the element to variable E and the content to var C.
For example, to do what you wanted to do:
while rdom; do
if [[ $E = title ]]; then
echo $C
exit
fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
Command-line tools that can be called from shell scripts include:
4xpath - command-line wrapper around Python's 4Suite package
XMLStarlet
xpath - command-line wrapper around Perl's XPath library
sudo apt-get install libxml-xpath-perl
Xidel - Works with URLs as well as files. Also works with JSON
I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.
You can use xpath utility. It's installed with the Perl XML-XPath package.
Usage:
/usr/bin/xpath [filename] query
or XMLStarlet. To install it on opensuse use:
sudo zypper install xmlstarlet
or try cnf xml
on other platforms.
This is sufficient...
xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With