Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to differentiate '-' operator from a negative number for a tokenizer

Tags:

c

parsing

token

I am creating an infix expression parser, an so I have to create a tokenizer. It works well, except for one thing: I do not now how to differentiate negative number from the "-" operator.

For example, if I have:

23 / -23

The tokens should be 23, / and -23, but if I have an expression like

23-22

Then the tokens should be 23, - and 22.

I found a dirty workaround which is if I encounter a "-" followed by a number, I look at the previous character and if this character is a digit or a ')', I treat the "-" as an operator and not a number. Apart from being kind of ugly, it doesn't work for expressions like

--56

where it gets the following tokens: - and -56 where it should get --56

Any suggestion?

like image 812
Brendan Rius Avatar asked Oct 23 '14 13:10

Brendan Rius


People also ask

How do you read negative numbers in assembly?

Negative numbers are represented in Two's Complement in assembly. In order to obtain Two's Complement of a number you have two options: to complement all it's bits and add one. to complement all it's bits until the last 1.

Is word tokenizer split?

In natural language processing, tokenization is the process of breaking human-readable text into machine readable components. The most obvious way to tokenize a text is to split the text into words.

How does the Bert tokenizer work?

BERT uses what is called a WordPiece tokenizer. It works by splitting words either into the full forms (e.g., one word becomes one token) or into word pieces — where one word can be broken into multiple tokens. An example of where this can be useful is where we have multiple forms of words.

What is SubWord tokenization?

SubWord Tokenisation The core concept behind subwords is that frequently occurring words should be in the vocabulary, whereas rare words should be split into frequent sub words. Eg. The word “refactoring” can be split into “re”, “factor”, and “ing”.


1 Answers

In the first example the tokens should be 23, /, - and 23.

The solution then is to evaluate the tokens according to the rules of associativity and precedence. - cannot bind to / but it can to 23, for example.

If you encounter --56, is split into -,-,56 and the rules take care of the problem. There is no need for special cases.

like image 68
2501 Avatar answered Oct 13 '22 21:10

2501