I'm starting to sketch out the structure of a quantitative-finance information service site written in Python (3.x I hope) and have come to the conclusion -- correct me if I'm wrong -- that I'm going to have to use both the eventlet networking library and the multiprocessing library.
One part of the system is basically a cron job that runs in the background, examining stock market and other financial data after market closings, does machine learning and quant calculations, then puts the predictions in a simple database or perhaps even a flat comma delimited file. (The arguments are thus passed between sections of the system via file.)
I understand that eventlet can be used for the nonblocking I/O so that beautiful soup or scrapy can web scrape the info from a lot of sites simultaneously (sort of) and the multiprocessing library can enable the machine learning / quant algorithms to do the calculations on all the stock data in parallel as separate processes.
To view the predictions, the users would log on to the other part of the system built with Flask which would access the database and display the predictions.
I presume that all these libraries and mixed threads/multiprocessing routines get along with each other? I'm going to use pythonanywhere.com as the host, and they appear to have quite a few "batteries included." Of course, when the testing is finished, I'll probably have to upgrade the number of "workers" to power the final deployed site.
Any pitfalls in mixing threads and multiprocessing in something this complicated?
Just some general thoughts that couldn't fit into the comments section:
scrapy already has some ways to process concurrent network requests via twisted. This means that you may not need to use eventlet? Of course, this depends on how exactly you are doing the scraping/what exactly you need to scrape. From what I've tried a long time ago (maybe I'm totally wrong), if you say needed selenium to scrape javascript responses, then it's hard to do this concurrently with scrapy. But if you are just doing get requests with urllib or something (eg: to APIs), then I think just scrapy is sufficient.
I agree with your comment- the web scraping part is always going to be pretty fail prone, so you definitely want to separate the scraping vs predictive parts. You need to take into account failed scrapes (eg: what if the website is down, or what if you are getting erroneous data), and clean up all the data before stuffing the cleaned data into your own database, and then (separately) run the machine learning stuff on that data.
But one important thing here is that you definitely need a database between the scraping and the machine learning (cannot just pass them in memory or via csv like you suggested). Countless reasons, a couple are:
Also btw django works really well with scrapy...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With