Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserving comments in `Text.Parsec.Token` tokenizers

I'm writing a source-to-source transformation using parsec, So I have a LanguageDef for my language and I build a TokenParser for it using Text.Parsec.Token.makeTokenParser:

myLanguage = LanguageDef { ...
  commentStart = "/*"
  , commentEnd = "*/"
  ...
}

-- defines 'stringLiteral', 'identifier', etc...
TokenParser {..} = makeTokenParser myLanguage

Unfortunately since I defined commentStart and commentEnd, each of the parser combinators in the TokenParser is a lexeme parser implemented in terms of whiteSpace, and whiteSpace eats spaces as well as comments.

What is the right way to preserve comments in this situation?

Approaches I can think of:

  1. Don't define commentStart and commentEnd. Wrap each of the lexeme parsers in another combinator that grabs comments before parsing each token.
  2. Implement my own version of makeTokenParser (or perhaps use some library that generalizes Text.Parsec.Token; if so, which library?)

What's the done thing in this situation?

like image 1000
Lambdageek Avatar asked Jun 26 '14 13:06

Lambdageek


People also ask

What is sent_tokenize?

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article" How sent_tokenize works ?

How to tokenize sentences into words?

The most popular method when tokenizing sentences into words is word_tokenize. word_tokenize separate words using spaces and punctuations. We want laugh/cry is split into 2 words. So we should consider another tokenizer option. WordPunctTokenizer splits all punctuations into separate tokens. So this may be what we want?

What is tokenization in Python?

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. text = "Hello everyone. Welcome to GeeksforGeeks.

How do punctuation tokenizers work?

These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it doesn’t discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing. Code #6: PunktWordTokenizer – It doen’t separates the punctuation from the words.


1 Answers

In principle, defining commentStart and commentEnd don't fit with preserving comments, because you need to consider comments as valid parts of both source and target language, including them in your grammar and your AST/ADT.

In this way, you'd be able to keep the text of the comment as the payload data of a Comment constructor, and output it appropriately in the target language, something like

data Statement = Comment String | Return Expression | ......

The fact that neither source nor target language sees the comment text as relevant is irrelevant for your translation code.


Major problem with this approach: It doesn't really fit well with makeTokenParser, and fits better with implementing your source language's parser from the ground up.

I guess I'm veering towards editing makeTokenParser to just get the comment parsers to return the String instead of ().

like image 149
AndrewC Avatar answered Sep 20 '22 17:09

AndrewC