What we are wanting to do is create a local data repository for our lab memebers to organize, search upon, access, catalog, reference our data, etc. I feel that CKAN can do all of these things; however, I'm not sure how it will handle these tasks for the data we actually have (I could be wrong, which is why I'm asking).
Our lab is procuring a lot of data for internal use. We would like to be able to catalog and organize this data within our group (maybe CKAN?) so people can push data to the catalog, and pull the data and use it. Some use cases would be, having ACL to the data, web interface, search, browse, organize, add, delete, update datasets etc. While CKAN looks to be a very good fit for this, the problem comes in with the data (more so the amount) we are trying to deal with.
We are wanting to catalog anything from terabytes of images (200k+ images), geospatial data in various formats, twitter streams (TBs of JSON data), database dump files, binary data, machine learning models, etc. I wouldn't think it would be reasonable to add 100k 64MB JSON files as a resource to a CKAN dataset, or is it? We realize we aren't going to be able to search within this JSON/images/geo data, which is fine. But we would like to find information out on if we had the data available (e.g. we search "twitter 2015-02-03"), a type of metadata search if you will. Using a local file store within CKAN, what happens if a user requests 200k images? Would the system become unresponsive when it is having to answer these requests?
I've seen CKAN used on the datahub.io and the vast majority of that stuff is small CSV files, small 2-3MB zip files, and no more than 20 or 30 individual files within a dataset.
So is CKAN capable of doing what we want? If it isn't any suggestions on alternatives?
Edit more specific questions instead of discussion:
I have looked around and googled for information regarding this topic but I haven't see a deployed system with any significant amount of data.
We're using CKAN at the Natural History Museum (data.nhm.ac.uk) for some pretty hefty research datasets - our main specimen collection has 2.8 million records - and it's handling it very well. We have had to extend CKAN with some custom plugins to make this possible though - but they're open source and available on Github.
Our datasolr extension moves querying large datasets into SOLR, which handles indexing and searching big datasets better than postgres (on our infrastructure anyway) - https://github.com/NaturalHistoryMuseum/ckanext-datasolr.
To prevent CKAN falling over when users download big files, we moved the packaging and download to a separate service and task queue.
https://github.com/NaturalHistoryMuseum/ckanext-ckanpackager https://github.com/NaturalHistoryMuseum/ckanpackager
So yes, CKAN with a few contributed plugins can definitely handle larger datasets. We haven't tested it with TB+ datasets yet, but we will next year when we use CKAN to release some phylogenetic data.
Yes :)
But there are extensions to use or build.
Take a look at the extensions built for CKAN Galleries (http://datashades.com/ckan-galleries/). We built that specifically for image and video assets that are referenced in the record level of a dataset resource.
There is an S3 cloud connector for object storage if needed.
We've started to look at various ways to extend CKAN so it can provide enterprise data storage and management for all types of data. Very large, real time, IoT specific, Linked Data, etc.
I think in some cases these will be addressed by adding the concept of 'resource containers' to CKAN. In some sense both file store and data store are examples of such resource container extensions.
Using AWS's API Gateway service we are looking at ways to present the request methods for data stored via external integration with third party solutions as if they were no different to other CKAN resources.
Although not everyone is there just yet, when you use infrastructure as software, which AWS enables, you can build some really neat stuff which looks like software running on a traditional web stack but is actually making use of S3, Lambda, temporary relational DBs and API Gateway to do some very heavy lifting.
We aim to open source the approach taken for such work as open architecture as it matures. We've started this already by publishing scripts used to build supercomputer clusters on AWS. You can find those here: https://github.com/DataShades/awscloud-hpc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With