Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multithreading and multiprocessing questions for quant site

I'm starting to sketch out the structure of a quantitative-finance information service site written in Python (3.x I hope) and have come to the conclusion -- correct me if I'm wrong -- that I'm going to have to use both the eventlet networking library and the multiprocessing library.

One part of the system is basically a cron job that runs in the background, examining stock market and other financial data after market closings, does machine learning and quant calculations, then puts the predictions in a simple database or perhaps even a flat comma delimited file. (The arguments are thus passed between sections of the system via file.)

I understand that eventlet can be used for the nonblocking I/O so that beautiful soup or scrapy can web scrape the info from a lot of sites simultaneously (sort of) and the multiprocessing library can enable the machine learning / quant algorithms to do the calculations on all the stock data in parallel as separate processes.

To view the predictions, the users would log on to the other part of the system built with Flask which would access the database and display the predictions.

I presume that all these libraries and mixed threads/multiprocessing routines get along with each other? I'm going to use pythonanywhere.com as the host, and they appear to have quite a few "batteries included." Of course, when the testing is finished, I'll probably have to upgrade the number of "workers" to power the final deployed site.

Any pitfalls in mixing threads and multiprocessing in something this complicated?

like image 371
user2953747 Avatar asked Oct 31 '22 09:10

user2953747


1 Answers

Just some general thoughts that couldn't fit into the comments section:

  1. scrapy already has some ways to process concurrent network requests via twisted. This means that you may not need to use eventlet? Of course, this depends on how exactly you are doing the scraping/what exactly you need to scrape. From what I've tried a long time ago (maybe I'm totally wrong), if you say needed selenium to scrape javascript responses, then it's hard to do this concurrently with scrapy. But if you are just doing get requests with urllib or something (eg: to APIs), then I think just scrapy is sufficient.

  2. I agree with your comment- the web scraping part is always going to be pretty fail prone, so you definitely want to separate the scraping vs predictive parts. You need to take into account failed scrapes (eg: what if the website is down, or what if you are getting erroneous data), and clean up all the data before stuffing the cleaned data into your own database, and then (separately) run the machine learning stuff on that data.

But one important thing here is that you definitely need a database between the scraping and the machine learning (cannot just pass them in memory or via csv like you suggested). Countless reasons, a couple are:

  • save on your scrapes (won't need to download multiple days of data every time, just the most recent day)
  • give you backup and historical data in case your web scrapes are not available any more (eg: say you are scraping last 365 days- what if your source of info only gives you the last 365 days but you suddenly want 700 days? You want to have saved the data from your previous scrapes somewhere)
  • be much faster/better/less flakey- having a correctly indexed db will probably be just as important if not more important than any sort of parallel processing of your machine learning algorithm.

Also btw django works really well with scrapy...

like image 185
conrad Avatar answered Nov 09 '22 13:11

conrad