Is there a common online algorithm to classify news dynamically? I have a huge data set of news classified by topics. I consider each of that topics a cluster. Now I need to classify breaking news. Probably, I will need to generate new topics, or new clusters, dynamically.
The algorithm I'm using is the following:
1) I go through a group of feeds from news sites and I recognize news links.
2) For each new link, I extract the content using dragnet, and then I tokenize it.
3) I find the vector representation of all the old news and the last one using TfidfVectorizer from sklearn.
4) I find the nearest neighbor in my dataset computing euclidean distance from the last news vector representation and all the vector representations of the old news.
5) If that distance is smaller than a threshold, I put it in the cluster that the neighbor belongs. Otherwise, I create a new cluster, with the breaking news.
Each time a news arrive, I re-fit all the data using a TfidfVectorizer, because new dimensions can be founded. I can't wait to re-fit once per day, because I need to detect breaking events, which can be related to unknown topics. Is there a common approach more efficient than the one I am using?
If you build the vectorization yourself, adding new data will be much easier.
There are well known, and very fast, implementations of this.
For example Apache Lucene. It can add new documents online, and it uses a variant of tfidf for search.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With