Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python or database?

Tags:

python

sql

i am reading a csv file into a list of a list in python. it is around 100mb right now. in a couple of years that file will go to 2-5gigs. i am doing lots of log calculations on the data. the 100mb file is taking the script around 1 minute to do. after the script does a lot of fiddling with the data, it creates URL's that point to google charts and then downloads the charts locally.

can i continue to use python on a 2gig file or should i move the data into a database?

like image 327
Alex Gordon Avatar asked Dec 08 '22 02:12

Alex Gordon


1 Answers

I don't know exactly what you are doing. But a database will just change how the data is stored. and in fact it might take longer since most reasonable databases may have constraints put on columns and additional processing for the checks. In many cases having the whole file local, going through and doing calculations is going to be more efficient than querying and writing it back to the database (subject to disk speeds, network and database contention, etc...). But in some cases the database may speed things up, especially because if you do indexing it is easy to get subsets of the data.

Anyway you mentioned logs, so before you go database crazy I have the following ideas for you to check out. Anyway I'm not sure if you have to keep going through every log since the beginning of time to download charts and you expect it to grow to 2 GB or if eventually you are expecting 2 GB of traffic per day/week.

  1. ARCHIVING -- you can archive old logs, say every few months. Copy the production logs to an archive location and clear the live logs out. This will keep the file size reasonable. If you are wasting time accessing the file to find the small piece you need then this will solve your issue.

  2. You might want to consider converting to Java or C. Especially on loops and calculations you might see a factor of 30 or more speedup. This will probably reduce the time immediately. But over time as data creeps up, some day this will slow down as well. if you have no bound on the amount of data, eventually even hand optimized Assembly by the world's greatest programmer will be too slow. But it might give you 10x the time...

  3. You also may want to think about figuring out the bottleneck (is it disk access, is it cpu time) and based on that figuring out a scheme to do this task in parallel. If it is processing, look into multi-threading (and eventually multiple computers), if it is disk access consider splitting the file among multiple machines...It really depends on your situation. But I suspect archiving might eliminate the need here.

  4. As was suggested, if you are doing the same calculations over and over again, then just store them. Whether you use a database or a file this will give you a huge speedup.

  5. If you are downloading stuff and that is a bottleneck, look into conditional gets using the if modified request. Then only download changed items. If you are just processing new charts then ignore this suggestion.

  6. Oh and if you are sequentially reading a giant log file, looking for a specific place in the log line by line, just make another file storing the last file location you worked with and then do a seek each run.

  7. Before an entire database, you may want to think of SQLite.

  8. Finally a "couple of years" seems like a long time in programmer time. Even if it is just 2, a lot can change. Maybe your department/division will be laid off. Maybe you will have moved on and your boss. Maybe the system will be replaced by something else. Maybe there will no longer be a need for what you are doing. If it was 6 months I'd say fix it. but for a couple of years, in most cases, I'd say just use the solution you have now and once it gets too slow then look to do something else. You could make a comment in the code with your thoughts on the issue and even an e-mail to your boss so he knows it as well. But as long as it works and will continue doing so for a reasonable amount of time, I would consider it "done" for now. No matter what solution you pick, if data grows unbounded you will need to reconsider it. Adding more machines, more disk space, new algorithms/systems/developments. Solving it for a "couple of years" is probably pretty good.

like image 166
Cervo Avatar answered Dec 23 '22 08:12

Cervo