I'm about to start working with data that is ~500 GB in size. I'd like to be able to access small components of the data at any given time with Python. I'm considering in using PyTables or MongoDB with PyMongo (or Hadoop - thanks Drahkar). Are there other file structures/DBs that I should consider?
Some of the operations I'll be doing are computing distances from one point to another. Extracting data based on indices from boolean tests and the like. The results may go online for a website, but at the moment it is intended to be only used on a desktop for analysis.
Cheers
If you are seriously looking at data processing on a Big Data process, I would highly suggest looking into Hadoop. One provider being Cloudera ( http://www.cloudera.com/ ). It is a very powerful platform that has many tools within it for data processing. Many languages, including Python, have modules for accessing the data, plus a hadoop cluster can do a significant amount of the processing for you once you have built the various mapreduce, Hive and hbase jobs for it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With