Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Metadata Harvesting

I'm trying to use the metadata harvesting package https://pypi.python.org/pypi/pyoai to harvest the data on this site https://www.duo.uio.no/oai/request?verb=Identify

I tried the example on the pyaoi site, but that did not work. When I test it I get a error. The code is:

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

URL = 'http://uni.edu/ir/oaipmh'
registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)

for record in client.listRecords(metadataPrefix='oai_dc'):
    print record

This is the stack trace:

Traceback (most recent call last):
  File "/Users/arashsaidi/PycharmProjects/get-new-DUO/get-files.py", line 8, in <module>
    for record in client.listRecords(metadataPrefix='oai_dc'):
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/common.py", line 115, in method
    return obj(self, **kw)
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/common.py", line 110, in __call__
    return bound_self.handleVerb(self._verb, kw)
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/client.py", line 65, in handleVerb
    kw, self.makeRequestErrorHandling(verb=verb, **kw))    
  File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/client.py", line 273, in makeRequestErrorHandling
    raise error.XMLSyntaxError(kw)
oaipmh.error.XMLSyntaxError: {'verb': 'ListRecords', 'metadataPrefix': 'oai_dc'}

I need to get access to all the files on the page I have linked to above plus generate an additional file with some metadata.

Any suggestions?

like image 489
Arash Saidi Avatar asked Dec 22 '14 16:12

Arash Saidi


People also ask

How is metadata harvested?

Metadata Harvesting refers to gathering metadata from multiple places or archives and storing it in a central database. Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability.

What is metadata harvesting in library science?

Metadata harvesting is the process of retrieving metadata information from other repositories and storing it locally. Brightspace Learning Repository supports metadata harvesting, acting as both harvester and metadata provider for external repositories.

What is meant by data harvesting?

Data harvesting is a process that copies datasets and their metadata between two or more data catalogs—a critical step in making data useful. It's similar to the techniques that search engines use to look for, catalog, and index content from different websites to make it searchable in a single location.

What protocol supports the harvesting of metadata?

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives.


2 Answers

I ended up using the Sickle package, which I found to have much better documentation and easier to use:

This code gets all the sets, and then retrieves each record from each set. This seems like the best solution given the fact that there are more than 30000 records to deal with. Doing it for each set gives more control. Hope this might help others out there. I have no idea why libraries use OAI, does not seem like a good way to organize data to me...

# gets sickle from OAI
        sickle = Sickle('http://www.duo.uio.no/oai/request')
        sets = sickle.ListSets()  # gets all sets
        for recs in sets:
            for rec in recs:
                if rec[0] == 'setSpec':
                    try:
                        print rec[1][0], self.spec_list[rec[1][0]]
                        records = sickle.ListRecords(metadataPrefix='xoai', set=rec[1][0], ignore_deleted=True)
                        self.write_file_and_metadata()
                    except Exception as e:
                        # simple exception handling if not possible to retrieve record
                        print('Exception: {}'.format(e))
like image 107
Arash Saidi Avatar answered Oct 05 '22 23:10

Arash Saidi


It seems that the link from the pyoai site (http://uni.edu/ir/oaipmh) is dead, because it returns 404.
Nevertheless, you should be able to get the data from your website like this:

from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader

URL = 'https://www.duo.uio.no/oai/request'
registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)

# identify info
identify = client.identify()
print "Repository name: {0}".format(identify.repositoryName())
print "Base URL: {0}".format(identify.baseURL())
print "Protocol version: {0}".format(identify.protocolVersion())
print "Granularity: {0}".format(identify.granularity())
print "Compression: {0}".format(identify.compression())
print "Deleted record: {0}".format(identify.deletedRecord())

# list records
records = client.listRecords(metadataPrefix='oai_dc')
for record in records:
    # do something with the record
    pass

# list metadata formats
formats = client.listMetadataFormats()
for f in formats:
    # do something with f
    pass
like image 45
bosnjak Avatar answered Oct 05 '22 22:10

bosnjak