This must be a classic .NET question for anyone migrating from Java.
.NET does not seem to have a direct equivalent to java.io.StreamTokenizer, however the JLCA provides a SupportClass that attempts to implement it. I believe the JLCA also provides a Tokenizer SupportClass that takes a String as the source, which I thought a StreamTokenizer would be derived from, but isn't.
What is the preferred way to Tokenize both a Stream and a String? or is there one? How are streams tokenized in .Net? I'd like to have the flexibility that java.io.StreamTokenizer provides. Any thoughts?
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.
Tokenization works by removing the valuable data from your environment and replacing it with these tokens. Most businesses hold at least some sensitive data within their systems, whether it be credit card data, medical information, Social Security numbers, or anything else that requires security and protection.
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. How sent_tokenize works ? The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.
The simplest way to tokenize text is to use whitespace within a string as the “delimiter” of words. This can be accomplished with Python's split function, which is available on all string object instances as well as on the string built-in class itself. You can change the separator any way you need.
There isn't anything in .NET that is completely equivalent to StreamTokenizer. For simple cases, you can use String.Split(), but for more advanced token parsing, you'll probably end up using System.Text.RegularExpressions.Regex.
Use System.String.Split if you need to split a string based on a collection of specific characters.
Use System.Text.RegularExpressions.RegEx.Split to split based on matching patterns.
There's a tokenizer in the Nextem library -- you can see an example here: http://trac.assembla.com/nextem/browser/trunk/Examples/Parsing.n
It's implemented as a Nemerle macro, but you can write this and then use it from C# easily.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With