lexers / parsers for (un) structured text documents [closed]

Question

There are lots of parsers and lexers for scripts (i.e. structured computer languages). But I'm looking for one which can break a (almost) non-structured text document into larger sections e.g. chapters, paragraphs, etc.

It's relatively easy for a person to identify them: where the Table of Contents, acknowledgements, or where the main body starts and it is possible to build rule based systems to identify some of these (such as paragraphs).

I don't expect it to be perfect, but does any one know of such a broad 'block based' lexer / parser? Or could you point me in the direction of literature which may help?

Noufal Ibrahim · Accepted Answer

Many lightweight markup languages like markdown (which incidentally SO uses), reStructured text and (arguably) POD are similar to what you're talking about. They have minimal syntax and break input down into parseable syntactic pieces. You might be able to get some information by reading about their implementations.

lexers / parsers for (un) structured text documents [closed]

Tags:

parsing

lexer

document

wilson32

1 Answers

Noufal Ibrahim

Recent Activity

Donate For Us

lexers / parsers for (un) structured text documents [closed]

Tags:

parsing

lexer

document

wilson32

1 Answers

Noufal Ibrahim

Related questions

Recent Activity

Donate For Us