Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Tokenizer - keeping the separators [duplicate]

I am working on porting code from JAVA to C#, and part of the JAVA code uses tokenizer - but it is my understanding that the resulting array from the stringtokenizer in Java will also have the separators (in this case +, -, /, *, (, )) as tokens. I have attempted to use the C# Split() function, but it seems to eliminate the separators themselves. In the end, this will parse a string and run it as a calculation. I have done a lot of research, and have not found any references on the topic.

Does anyone know how to get the actual separators, in the order they were encountered, to be in the split array?

Code for token-izing:

public CalcLexer(String s)
{
    char[] seps = {'\t','\n','\r','+','-','*','/','(',')'};
    tokens = s.Split(seps);
    advance();
}

Testing:

static void Main(string[] args)
    {
        CalcLexer myCalc = new CalcLexer("24+3");
        Console.ReadLine();
    }

The "24+3" would result in the following output: "24", "3" I am looking for an output of "24", "+", "3"

In the nature of full disclosure, this project is part of a class assignment, and uses the following complete source code:

http://www.webber-labs.com/mpl/source%20code/Chapter%20Seventeen/CalcParser.java.txt http://www.webber-labs.com/mpl/source%20code/Chapter%20Seventeen/CalcLexer.java.txt

like image 395
Ipster Avatar asked Jul 15 '09 21:07

Ipster


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is the full name of C?

In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr. Stroustroupe.

Is C language easy?

C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.

How old is the letter C?

The letter c was applied by French orthographists in the 12th century to represent the sound ts in English, and this sound developed into the simpler sibilant s.


1 Answers

You can use Regex.Split with zero-width assertions. For example, the following will split on +-*/:

Regex.Split(str, @"(?=[-+*/])|(?<=[-+*/])");

Effectively this says, "split at this point if it is followed by, or preceded by, any of -+*/. The matched string itself will be zero-length, so you won't lose any part of the input string.

like image 120
Pavel Minaev Avatar answered Sep 30 '22 15:09

Pavel Minaev