I have been trying to process a good chunk of data (a few GBs) but my personal computer resists to do it in a reasonable time span, so I was wondering what options do I have? I was using python's csv.reader but it was painfully slow even to fetch 200,000 lines. Then I migrated this data to an sqlite database which retrieved results a bit faster and without using so much memory but slowness was still a major issue.
So, again... what options do I have to process this data? I was wondering about using amazon's spot instances which seem useful for this kind of purpose but maybe there are other solutions to explore.
Supposing that spot instances is a good option and considering I have never used them before, I'd like to ask what can I expect from them? Does anyone have experience using them for this kind of thing? If so, What is your workflow? I thought I could find a few blog posts detailing workflows for scientific computing, image processing or that kind of thing but I didn't find anything so if you can explain a bit of that or point out some links, I'd appreciate it.
Thanks in advance.
read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
The 1-gram dataset expands to 27 Gb on disk which is quite a sizable quantity of data to read into python. As one lump, Python can handle gigabytes of data easily, but once that data is destructured and processed, things get a lot slower and less memory efficient.
Python is considered as one of the best data science tool for the big data job. Python and big data are the perfect fit when there is a need for integration between data analysis and web apps or statistical code with the production database.
If you have to use python, you can try dumbo which allows you to run Hadoop programs in python. It's very easy to start with. Then you can write your own code to do hadoop streaming to process your Big Data. Do check its short tutorial: https://github.com/klbostee/dumbo/wiki/Short-tutorial
A similar one is from yelp: https://github.com/Yelp/mrjob
I would try to use numpy
to work with your large datasets localy. Numpy arrays should use less memory compared csv.reader
and computation times should be much faster when using vectorised numpy functions.
However there may be a memory problem when reading the file.
numpy.loadtxt
or numpy.genfromtxt
also consume a lot of memory when reading files.
If this is a problem some (brand new) alternative parser engines are compared here. According to this post, the new pandas
(a library which is built on top of numpy) parser seems to be an option.
As mentioned in the comments I would also suggest to store your data in a binary format like HDF5 when you have read your files once. Loading the data from a HDF5 file is really fast from my experience (would be interesting to know how fast it is compared to sqlite in your case). The simplest way I know to save your array as HDF5 is with pandas
import pandas as pd
data = pd.read_csv(filename, options...)
store = pd.HDFStore('data.h5')
store['mydata'] = data
store.close()
loading your data is than as simple as
import pandas as pd
store = pd.HDFStore('data.h5')
data = store['mydata']
store.close()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With