Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting and Storing XML Data with libXML/XPath

Tags:

xml

perl

xpath

use XML::LibXML;
use Data::Dumper; 

#parsing file
my $dom = XML::LibXML->new->parse_file('sample.xml');

my $context = XML::LibXML::XPathContext->new( $dom->documentElement()  );
$context->registerNs('u', 'http://uniprot.org/uniprot');

#print file to make sure it looks ok
print $dom, "\n";

    #finds shortnames
    my $sn = $context->findnodes('//u:shortName');
    print 'ShortName: '.$sn, "\n";

    #finds dbRefernce ids that are of type EC
    my $ids = $context->findnodes('//u:dbReference[@type="EC"]/@id');   
    my $number =()= $ids =~ /\./gi;
    print 'EC Values: '.$ids, "\n";

    #finds sequences that have a length
    my $seq = $context->findnodes('//u:sequence[@length>1]');
    $seq =~ s/" "/"\n"/;
    print 'Sequence: '.$seq, "\n";

I currently have this code, that runs on this xml file that has 10 tags (https://www.dropbox.com/s/dq8ir9f22cnfwrz/Sample.xml). As of now, it is extracting the shortname, dbReference, and sequence of the 10 entries in this xml file and adding them together to print. What I would like to do, it have a shortname, dbReference, and Sequence for each entry in the xml file. Is it possible to have the script look for these data one at a time for each entry? My end goal is to format them in a specific way for output.

I was thinking of having code that runs before this, that will extract only the entries, then send them to the rest of the code for data extraction.

Thanks

like image 291
bforcer Avatar asked Nov 22 '25 10:11

bforcer


2 Answers

You need to query for a node-set (which returns a collection):

my @entries = $context->findnodes('//u:entry');

Then, for each node you run a contextual XPath expression findnodes(expression, context-node), passing the node as the second argument, for example:

foreach $entry (@entries) {
    my $entryName  = $context->findnodes('u:name', $entry);
    ...
}

Here is an attempt using your code:

use XML::LibXML;
use Data::Dumper; 

#parsing file
my $dom = XML::LibXML->new->parse_file('sample.xml');

my $context = XML::LibXML::XPathContext->new( $dom->documentElement()  );
$context->registerNs('u', 'http://uniprot.org/uniprot');

my @entries = $context->findnodes('//u:entry');
foreach $entry (@entries) {

    my $entryName  = $context->findnodes('u:name', $entry);
    my @shortNames = $context->findnodes('.//u:shortName', $entry);
    my @dbRefs     = $context->findnodes('.//u:dbReference[@type="EC"]/@id', $entry);
    my $sequence   = $context->findnodes('.//u:sequence[@length>1]');

    print "============================================================\n";
    print "\nName: ".$entryName."\n";

    print "\nShort Names: \n";
    $i=0;
    foreach $shortName (@shortNames) {
        print ++$i.': '.$shortName->firstChild, "\n";
    }

    print "\nEC Values: \n";
    $i=0;
    foreach $dbRef (@dbRefs) {
        print ++$i.': '.$dbRef->nodeValue, "\n";
    }

    $sequence =~ s/" "/"\n"/;
    print "\nSequence: ".$sequence, "\n";
}
like image 150
helderdarocha Avatar answered Nov 25 '25 06:11

helderdarocha


It looks like //sequence is your primary interest, so you just need to iterate over the values returned by findnodes:

for my $seq ($context->findnodes('//u:sequence[@length>1]')) {
    print 'Sequence @length: '.$seq->getAttribute('length'). "\n";
    # ...
}

Then you just need to pull the other values relative to this node. To find out how to do that, just google XML::LibXML Namespace and the third result is a perlmonks post: XML::LibXML and namespaces

for my $seq ($context->findnodes('//u:sequence[@length>1]')) {
    print 'Sequence @length: '.$seq->getAttribute('length'). "\n";

    my @sn = $context->findnodes('..//u:shortName', $seq);
    print '  ShortName Count: '.@sn. "\n";

    my @ids = $context->findnodes('..//u:dbReference[@type="EC"]/@id', $seq);   
    print '  EC Values Count: '.@ids. "\n";
}

Output (Note, not every seq has a shortName):

Sequence @length: 323
  ShortName Count: 5
  EC Values Count: 7
Sequence @length: 503
  ShortName Count: 0
  EC Values Count: 5
Sequence @length: 323
  ShortName Count: 3
  EC Values Count: 4
Sequence @length: 490
  ShortName Count: 0
  EC Values Count: 4
Sequence @length: 490
  ShortName Count: 0
  EC Values Count: 4
Sequence @length: 323
  ShortName Count: 3
  EC Values Count: 3
Sequence @length: 323
  ShortName Count: 3
  EC Values Count: 3
Sequence @length: 539
  ShortName Count: 2
  EC Values Count: 3
Sequence @length: 494
  ShortName Count: 1
  EC Values Count: 3
Sequence @length: 277
  ShortName Count: 0
  EC Values Count: 3

For additional tips on how to construct XPaths, check out: XPath Examples

like image 30
Miller Avatar answered Nov 25 '25 07:11

Miller



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!