use XML::LibXML;
use Data::Dumper;
#parsing file
my $dom = XML::LibXML->new->parse_file('sample.xml');
my $context = XML::LibXML::XPathContext->new( $dom->documentElement() );
$context->registerNs('u', 'http://uniprot.org/uniprot');
#print file to make sure it looks ok
print $dom, "\n";
#finds shortnames
my $sn = $context->findnodes('//u:shortName');
print 'ShortName: '.$sn, "\n";
#finds dbRefernce ids that are of type EC
my $ids = $context->findnodes('//u:dbReference[@type="EC"]/@id');
my $number =()= $ids =~ /\./gi;
print 'EC Values: '.$ids, "\n";
#finds sequences that have a length
my $seq = $context->findnodes('//u:sequence[@length>1]');
$seq =~ s/" "/"\n"/;
print 'Sequence: '.$seq, "\n";
I currently have this code, that runs on this xml file that has 10 tags (https://www.dropbox.com/s/dq8ir9f22cnfwrz/Sample.xml). As of now, it is extracting the shortname, dbReference, and sequence of the 10 entries in this xml file and adding them together to print. What I would like to do, it have a shortname, dbReference, and Sequence for each entry in the xml file. Is it possible to have the script look for these data one at a time for each entry? My end goal is to format them in a specific way for output.
I was thinking of having code that runs before this, that will extract only the entries, then send them to the rest of the code for data extraction.
Thanks
You need to query for a node-set (which returns a collection):
my @entries = $context->findnodes('//u:entry');
Then, for each node you run a contextual XPath expression findnodes(expression, context-node), passing the node as the second argument, for example:
foreach $entry (@entries) {
my $entryName = $context->findnodes('u:name', $entry);
...
}
Here is an attempt using your code:
use XML::LibXML;
use Data::Dumper;
#parsing file
my $dom = XML::LibXML->new->parse_file('sample.xml');
my $context = XML::LibXML::XPathContext->new( $dom->documentElement() );
$context->registerNs('u', 'http://uniprot.org/uniprot');
my @entries = $context->findnodes('//u:entry');
foreach $entry (@entries) {
my $entryName = $context->findnodes('u:name', $entry);
my @shortNames = $context->findnodes('.//u:shortName', $entry);
my @dbRefs = $context->findnodes('.//u:dbReference[@type="EC"]/@id', $entry);
my $sequence = $context->findnodes('.//u:sequence[@length>1]');
print "============================================================\n";
print "\nName: ".$entryName."\n";
print "\nShort Names: \n";
$i=0;
foreach $shortName (@shortNames) {
print ++$i.': '.$shortName->firstChild, "\n";
}
print "\nEC Values: \n";
$i=0;
foreach $dbRef (@dbRefs) {
print ++$i.': '.$dbRef->nodeValue, "\n";
}
$sequence =~ s/" "/"\n"/;
print "\nSequence: ".$sequence, "\n";
}
It looks like //sequence is your primary interest, so you just need to iterate over the values returned by findnodes:
for my $seq ($context->findnodes('//u:sequence[@length>1]')) {
print 'Sequence @length: '.$seq->getAttribute('length'). "\n";
# ...
}
Then you just need to pull the other values relative to this node. To find out how to do that, just google XML::LibXML Namespace and the third result is a perlmonks post: XML::LibXML and namespaces
for my $seq ($context->findnodes('//u:sequence[@length>1]')) {
print 'Sequence @length: '.$seq->getAttribute('length'). "\n";
my @sn = $context->findnodes('..//u:shortName', $seq);
print ' ShortName Count: '.@sn. "\n";
my @ids = $context->findnodes('..//u:dbReference[@type="EC"]/@id', $seq);
print ' EC Values Count: '.@ids. "\n";
}
Output (Note, not every seq has a shortName):
Sequence @length: 323
ShortName Count: 5
EC Values Count: 7
Sequence @length: 503
ShortName Count: 0
EC Values Count: 5
Sequence @length: 323
ShortName Count: 3
EC Values Count: 4
Sequence @length: 490
ShortName Count: 0
EC Values Count: 4
Sequence @length: 490
ShortName Count: 0
EC Values Count: 4
Sequence @length: 323
ShortName Count: 3
EC Values Count: 3
Sequence @length: 323
ShortName Count: 3
EC Values Count: 3
Sequence @length: 539
ShortName Count: 2
EC Values Count: 3
Sequence @length: 494
ShortName Count: 1
EC Values Count: 3
Sequence @length: 277
ShortName Count: 0
EC Values Count: 3
For additional tips on how to construct XPaths, check out: XPath Examples
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With