Java API : downloading and calculating tf-idf for a given web page

2 Answers

You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.

answered Oct 16 '22 12:10

Christoph Seibert

Actually, TF-IDF is a score given to a term in a document, rather than the whole document. If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene. If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.

answered Oct 16 '22 12:10

Yuval F

Related questions
                            
                                Concatenate ByteArrayOutputStream
                            
                                JPA 2 CriteriaQuery Question
                            
                                Is the file accessed by an other application
                            
                                Java HTTP Server Library
                            
                                Equivalent of BufferedImage from Java to C#
                            
                                Java: is it good practice to define beans in XML?
                            
                                sun.org.mozilla Rhino and extending Java abstract classes
                            
                                java events,handlers and listeners question
                            
                                Java Annotations and apt (fundamentals)
                            
                                Inner Classes in Java
                            
                                java HOUR and HOUR_OF_DAY both returning 12-hr time
                            
                                Release vs Debug Mode in IDE
                            
                                How to set the color of an Eclipse/RCP decorator?
                            
                                Looking for a simple Java spider [closed]
                            
                                emf to jpg conversion
                            
                                See the java heap content in run time
                            
                                ImageMagick convert exit status 133
                            
                                Suggest a maven repository manager for my projects
                            
                                Creating generic method names in generic class?
                            
                                Can I get Java to throw an exception when doing a comparison between floats when one of them turns out to be NaN?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java API : downloading and calculating tf-idf for a given web page

Tags:

java

solr

lucene

tf-idf

Yuvi

People also ask

2 Answers

Christoph Seibert

Yuval F

Recent Activity

Donate For Us