I am working on a project involving processing a large volume of XBRL documents (> 1m separate files). I am totally new to XBRL and feeling quite lost at the moment.
I have data relating to those XBRL documents in a separate MySQL database and I would like to add the XBRL data into MySQL to store everything in one db.
What are the best methods to go about transferring data from the XBRL docs into MySQL?
Are there any bulk processing libraries available for it?
I've been looking for tutorials on those issues but couldn't find anything providing a basic introduction, just a lot of high level info.
The natural paradigm in theory for storing XBRL in a database would be OLAP, because XBRL is about data cubes. OLAP on top of a relational database would be called ROLAP.
This is not a trivial problem, because facts taken from a large number of taxonomies can form a very large and sparse cube (for SEC filings it's 10k+ dimensions), and also because creating an SQL schema requires knowing the taxonomies before any import. If new taxonomies come up, one needs to re-ETL everything. This doesn't make relational databases suitable as a general solution.
If the filings share the same taxonomy and the taxonomy is very simple though (as in: not too many dimensions), it is possible to come up with an ad-hoc mapping to store all facts in a single table with many rows in the ROLAP sense (facts to rows, aspects to columns). Some vendors are specialized in storing non-dimensional XBRL facts, in which case traditional SQL (or "post-SQL" that scale with rows) offerings work well.
Some vendors create a table for each XBRL hypercube in the taxonomy, with a schema derived from the definition network but different for each hypercube. This can lead to a lot of tables in the database, and requires a lot of joins for queries involving multiple hypercubes.
Some other vendors make assumptions about the underlying XBRL structure, or about the kind of queries that their users need to run. Restricting the scope of the problem allows finding specific architectures or SQL schemas that can also do the job for these specific needs.
Finally, to import large amounts of filings, it is possible to build generic mappings on top of NoSQL data stores rather than relational databases. Large numbers of facts with a varying number of dimensions fit in large collections of semi-structured documents, and networks fit well in a hierarchical format.
First you need to realize that XBRL documents (instances) contain many different types of information. For example: it can contain daily pricing information for investment funds, but also quarterly VAT reports or credit worthiness information. XBRL is a standard way of communicating, but the contents have their own (XBRL standardized) taxonomies. For example: there is a Dutch Taxonomy, on which the Dutch revenue agency is built (with its own taxonomy), on which there is a specific taxonomy for filing VAT reports. These taxonomies are defined using XSD, Xlink and linkbase. Think of it as the concept of a Dictionairy: the way dictionairies are built is the same everywhere (use each letter of the alphabet to make 'chapters', sort words alphabetically, etc etc), but a Greek dictionairy uses its own alphabet, its own words and its own language to explain the contents.
So if you only use one or a few different types of XBRL documents (that share the same taxonomies), you could create a mapping from these taxonomies to your own (database) objects. If you have a wider range of taxonomies, you'll have to create a more generic solution that can 'import' taxonomies. That will be quite a challenge (which is the reason there aren't many tools available on the market).
If you(r company) can afford it, I recommend looking into existing tools like Altova's MapForce. That way there's no need to learn XBRL, XSD, Xlink and linkbase just in order to start developing your own tool to parse these files, you can leverage the existing products to map the XBRL taxonomies to your database/application.
I hope you do know that MySQL is an structured data storage whereas XBRL
is just a representation to map business document into digital format. XBRL
is XML
based document, which implies it is unstructured and the data you require from document can or cannot occur in that specific document. It can contain any other additional information too. XSD
defines how XML can be structured and how many times any tag can occur. Now to answer your question, you could use eXistDB
, which I also used in past to store the XBRL document. However, it can be slow at times. If you require just some data from XBRL
and need it to store in MySQL database you could use XPATH
. In the following simple python code, you could grab the values of EquityTotalEndingBalance
and ReservesTotalEndingBalance
from this document.
from lxml import etree
root = etree.fromstring(open("file.xml").read())
nsmap = root.nsmap
nsmap.pop(None) # There was some error without this.
data_one = root.xpath("//iascf-pfs:EquityTotalEndingBalance/text()",namespaces=nsmap)
data_two = root.xpath("//novartis:ReservesTotalEndingBalance/text()",namespaces=nsmap)
print data_one
print data_two
This code will print the values:
['37216000000', '36862000000', '42245000000']
['35903000000', '35558000000', '40971000000']
So how can you solve your problem then?
Either you would have to choose a NoSQL
based XML document storage like eXistDB
and write Xpath to get the particular data.
You can manually parse the XBRL document like above and fire the XPath directly and store the data.
The complexity will arise if you need to consume all data from all type of document. So you would have to limit on what you will be consuming from those documents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With