Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract words out of a text file

Tags:

java

text

Let's say you have a text file like this one: http://www.gutenberg.org/files/17921/17921-8.txt

Does anyone has a good algorithm, or open-source code, to extract words from a text file? How to get all the words, while avoiding special characters, and keeping things like "it's", etc...

I'm working in Java. Thanks

like image 340
Nathan H Avatar asked Nov 09 '08 22:11

Nathan H


People also ask

How do I extract text from a Word document?

Open the DOCX file and click on File > Save As > Computer > Browser. Choose to save file as Plain Text (for XLSX files, save it as Text (Tab delimited)). Locate and open the text file with the name you have used to save it. This text file will contain only the text from your original file without any formatting.

How do I extract a word from a shell script?

You could use grep : -E '\w+' searches for words. -o only prints the portion of the line that matches.

How do I extract a specific line from a file in Python?

Method 1: fileobject.readlines() A file object can be created in Python and then readlines() method can be invoked on this object to read lines into a stream. This method is preferred when a single line or a range of lines from a file needs to be accessed simultaneously.


1 Answers

This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don't know how to start:

String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);

while ( m.find() ) {
    System.out.println(input.substring(m.start(), m.end()));
}

The pattern [\w']+ matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.

like image 55
Tomalak Avatar answered Sep 21 '22 08:09

Tomalak