I'm writing a source-to-source transformation using parsec, So I have a LanguageDef
for my language and I build a TokenParser
for it using Text.Parsec.Token.makeTokenParser
:
myLanguage = LanguageDef { ...
commentStart = "/*"
, commentEnd = "*/"
...
}
-- defines 'stringLiteral', 'identifier', etc...
TokenParser {..} = makeTokenParser myLanguage
Unfortunately since I defined commentStart
and commentEnd
, each of the parser combinators in the TokenParser
is a lexeme parser implemented in terms of whiteSpace
, and whiteSpace
eats spaces as well as comments.
What is the right way to preserve comments in this situation?
Approaches I can think of:
commentStart
and commentEnd
. Wrap each of the lexeme parsers in another combinator that grabs comments before parsing each token.makeTokenParser
(or perhaps use some library that generalizes Text.Parsec.Token
; if so, which library?)What's the done thing in this situation?
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. text = "Hello everyone. Welcome to GeeksforGeeks. You are studying NLP article" How sent_tokenize works ?
The most popular method when tokenizing sentences into words is word_tokenize. word_tokenize separate words using spaces and punctuations. We want laugh/cry is split into 2 words. So we should consider another tokenizer option. WordPunctTokenizer splits all punctuations into separate tokens. So this may be what we want?
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. text = "Hello everyone. Welcome to GeeksforGeeks.
These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it doesn’t discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing. Code #6: PunktWordTokenizer – It doen’t separates the punctuation from the words.
In principle, defining commentStart and commentEnd don't fit with preserving comments, because you need to consider comments as valid parts of both source and target language, including them in your grammar and your AST/ADT.
In this way, you'd be able to keep the text of the comment as the payload data of a Comment constructor, and output it appropriately in the target language, something like
data Statement = Comment String | Return Expression | ......
The fact that neither source nor target language sees the comment text as relevant is irrelevant for your translation code.
Major problem with this approach: It doesn't really fit well with makeTokenParser
, and fits better with implementing your source language's parser from the ground up.
I guess I'm veering towards editing makeTokenParser
to just get the comment parsers to return the String
instead of ()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With