Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract xml tag value using awk command

Tags:

shell

unix

aix

xml

awk

I have a xml like below

<root>    
<FIToFICstmrDrctDbt>
            <GrpHdr>
                <MsgId>A</MsgId>
                <CreDtTm>2001-12-17T09:30:47</CreDtTm>
                <NbOfTxs>0</NbOfTxs>
                <TtlIntrBkSttlmAmt Ccy="EUR">0.0</TtlIntrBkSttlmAmt>
                <IntrBkSttlmDt>1967-08-13</IntrBkSttlmDt>
                <SttlmInf>
                    <SttlmMtd>CLRG</SttlmMtd>
                    <ClrSys>
                        <Prtry>xx</Prtry>
                    </ClrSys>
                </SttlmInf>
                <InstgAgt>
                    <FinInstnId>
                        <BIC>AAAAAAAAAAA</BIC>
                    </FinInstnId>
                </InstgAgt>
            </GrpHdr>
    </FIToFICstmrDrctDbt>
</root>

I need to extract the value of each tag value in separate variables using awk command. how to do it?

like image 465
user1929905 Avatar asked Dec 27 '12 11:12

user1929905


People also ask

What is awk '{ print $1 }'?

If you notice awk 'print $1' prints first word of each line. If you use $3, it will print 3rd word of each line.

Can I use cut with awk?

The awk implementation of cut uses the getopt() library function (see Processing Command-Line Options) and the join() library function (see Merging an Array into a String). The current POSIX version of cut has options to cut fields based on both bytes and characters.

What is awk F command?

For example: awk –F":" '{ print $3 }' file.dat. indicates that the given data file uses colon ( : ) characters to separate record fields. The –F option must come before the quoted program instructions. awk also allows you to define the value of variables on the command line by using the –v option.

What does $1 $2 indicate in awk file?

The awk variables $1 or $2 through $nn represent the fields of each record and should not be confused with shell variables that use the same style of names. Inside an awk script $1 refers to field 1 of a record; $2 to field 2 of a record.


2 Answers

You can use awk as shown below, however, this is NOT a robust solution and will fail if the xml is not formatted correctly e.g. if there are multiple elements on the same line.

$ dt=$(awk -F '[<>]' '/IntrBkSttlmDt/{print $3}' file)
$ echo $dt
1967-08-13

I suggest you use a proper xml processing tool, like xmllint.

$ dt=$(xmllint --shell file <<< "cat //IntrBkSttlmDt/text()" | grep -v "^/ >")
$ echo $dt
1967-08-13
like image 114
dogbane Avatar answered Sep 27 '22 20:09

dogbane


The following gawk command uses a record separator regex pattern to match the XML tags. Anything starting with a < followed by at least one non-> and terminated by a > is considered to be a tag. Gawk assigns each RS match into the RT variable. Anything between the tags will be parsed as the record text which gawk assigns to $0.

gawk 'BEGIN { RS="<[^>]+>" } { print RT, $0 }' myfile
like image 36
Michael Hamilton Avatar answered Sep 27 '22 21:09

Michael Hamilton