Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I split a string into tokens?

If I have a string

'x+13.5*10x-4e1'

how can I split it into the following list of tokens?

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

Currently I'm using the shlex module:

str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
    tokenList.append(str(token))
return tokenList

But this returns:

['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']

So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.

In an ideal world, e and E would not be recognised as letters in the same way, so

'-4e1'

would become

['-', '4e1']

but

'-4x1'

would become

['-', '4', 'x', '1']

Can anybody help?

like image 426
Martin Thetford Avatar asked Aug 19 '13 11:08

Martin Thetford


People also ask

How do you split a string into tokens in Python?

You can tokenize any string with the 'split()' function in Python. This function takes a string as an argument, and you can further set the parameter of splitting the string. However, if you don't set the parameter of the function, it takes 'space' as a default parameter to split the strings.

How do I split a string without a separator?

Q #4) How to split a string in Java without delimiter or How to split each character in Java? Answer: You just have to pass (“”) in the regEx section of the Java Split() method. This will split the entire String into individual characters.


2 Answers

Use the regular expression module's split() function, to split at

  • '\d+' -- digits (number characters) and
  • '\W+' -- non-word characters:

CODE:

import re

print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']

If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:

  • [\d.]+ -- digit or dot characters (although this allows you to write: 13.5.5

CODE:

print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])

OUTPUT:

['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']
like image 90
Peter Varo Avatar answered Sep 28 '22 16:09

Peter Varo


Another alternative not suggested here, is to using nltk.tokenize module

like image 26
redrubia Avatar answered Sep 28 '22 16:09

redrubia