Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the methods for tokenizing strings in .Net?

This must be a classic .NET question for anyone migrating from Java.

.NET does not seem to have a direct equivalent to java.io.StreamTokenizer, however the JLCA provides a SupportClass that attempts to implement it. I believe the JLCA also provides a Tokenizer SupportClass that takes a String as the source, which I thought a StreamTokenizer would be derived from, but isn't.

What is the preferred way to Tokenize both a Stream and a String? or is there one? How are streams tokenized in .Net? I'd like to have the flexibility that java.io.StreamTokenizer provides. Any thoughts?

like image 261
Jeffrey LeCours Avatar asked Sep 26 '08 20:09

Jeffrey LeCours


People also ask

What is Tokenizing a string?

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.

Which of the following processes are part of tokenization?

Tokenization works by removing the valuable data from your environment and replacing it with these tokens. Most businesses hold at least some sensitive data within their systems, whether it be credit card data, medical information, Social Security numbers, or anything else that requires security and protection.

How tokenization is done?

Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. How sent_tokenize works ? The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.

What function we should use when we want to tokenize a string?

The simplest way to tokenize text is to use whitespace within a string as the “delimiter” of words. This can be accomplished with Python's split function, which is available on all string object instances as well as on the string built-in class itself. You can change the separator any way you need.


3 Answers

There isn't anything in .NET that is completely equivalent to StreamTokenizer. For simple cases, you can use String.Split(), but for more advanced token parsing, you'll probably end up using System.Text.RegularExpressions.Regex.

like image 98
Peter Provost Avatar answered Oct 29 '22 19:10

Peter Provost


Use System.String.Split if you need to split a string based on a collection of specific characters.

Use System.Text.RegularExpressions.RegEx.Split to split based on matching patterns.

like image 44
Vijesh VP Avatar answered Oct 29 '22 20:10

Vijesh VP


There's a tokenizer in the Nextem library -- you can see an example here: http://trac.assembla.com/nextem/browser/trunk/Examples/Parsing.n

It's implemented as a Nemerle macro, but you can write this and then use it from C# easily.

like image 33
Serafina Brocious Avatar answered Oct 29 '22 19:10

Serafina Brocious