Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What character can be used to parse for paragraphs with Java?

Tags:

java

I'm sure folks will get a good laugh out of this one, but for the life of me I cannot find a seperator that will indicate when a new paragraph has begun in a string of text. Word, and line? Easy peasy, but paragraph seems to be much harder to find. I've tried two line breaks in a row, the Unicode representation of paragraph break and line break, with no luck.

EDIT: I apologize for the vagueness of my original question. To answer some of the questions, it is a basic text file originally created on windows. I'm testing some code for opening and analyzing it's contents with the Blackberry JDE 4.5 using the RIM eclipse plugin. While the source of the file will be windows (at least for the foreseeable future) and be basic text, I have no control over how they are created (it's a third party source that I dont' have access to the way it is created)

like image 659
canadiancreed Avatar asked Feb 02 '10 22:02

canadiancreed


3 Answers

There is no such paragraph break character in common usage.

You might be able to get away with assuming that two or more line breaks in a row (with optional horizontal whitespace) indicates a paragraph break. But there are numerous exceptions to this "rule". For example, when a paragraph

  • is interrupted by a floating figure, or
  • contains bullet points

and then continues on ... like this one. For that kind of thing, there is probably no solution.

EDIT per @Aiden's comment below. (It is now clear that this is not relevant to the OP, but it may be relevant to others who find the question via Google, etc)

Instead of trying to reverse engineer paragraphs from text, perhaps you should consider specifying that your input should be in (for example) Markdown syntax; i.e. as supported by StackOverflow. The Markdown Wiki includes links to markdown parser implementations in many languages, including Java.

(This assumes that you have some control over the input format of the text you are trying to parse into paragraphs, etcetera.)

like image 104
Stephen C Avatar answered Sep 26 '22 03:09

Stephen C


Paragraphs in plain text documents are usually separated by two or more line separators. A line separator may be a linefeed (\n), a carriage-return (\r), or a carriage-return followed by a linefeed (\r\n). These three kinds of separator are typically associated with operating systems, but any application is free to write text using any kind of line separator. In fact, text that's been assembled from diverse sources (like a web page) may well contain two or more kinds of separator. When your app reads text, no matter what platform it's running on, it should always check for all three kinds of line separator.

BufferedReader#readLine() does that, but of course it only reads one line at a time. Simple prose will usually be returned as an alternating sequence of non-empty lines representing paragraphs, and empty lines representing the spaces between them. But don't count on it; watch for multiple empty lines, and be aware that "empty" lines may in fact contain whitespace characters like space (\u0020) and TAB (\u0009).

If you choose not to go with a BufferedReader, you may have to write the detection code from scratch. Java ME doesn't include regex support, so split() and java.util.Scanner are not available; and StringTokenizer makes no distinction between a single delimiter character and several in a row unless you use the returnDelims option. Then it returns the delimiters one character at a time, so you still have to write your own code to figure out what kind of separator you're looking at, if any.

like image 37
Alan Moore Avatar answered Sep 23 '22 03:09

Alan Moore


It is possible that instead on line feed you need to look for a CR LF sequence (\r\n) - obviously the answer would depend on the text format.

like image 32
Ofir Avatar answered Sep 25 '22 03:09

Ofir