If I have a string
'x+13.5*10x-4e1'
how can I split it into the following list of tokens?
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
Currently I'm using the shlex module:
str = 'x+13.5*10x-4e1'
lexer = shlex.shlex(str)
tokenList = []
for token in lexer:
tokenList.append(str(token))
return tokenList
But this returns:
['x', '+', '13', '.', '5', '*', '10x', '-', '4e1']
So I'm trying to split the letters from the numbers. I'm considering taking the strings that contain both letters and numbers then somehow splitting them, but not sure about how to do this or how to add them all back into the list with the others afterwards. It's important that the tokens stay in order, and I can't have nested lists.
In an ideal world, e and E would not be recognised as letters in the same way, so
'-4e1'
would become
['-', '4e1']
but
'-4x1'
would become
['-', '4', 'x', '1']
Can anybody help?
You can tokenize any string with the 'split()' function in Python. This function takes a string as an argument, and you can further set the parameter of splitting the string. However, if you don't set the parameter of the function, it takes 'space' as a default parameter to split the strings.
Q #4) How to split a string in Java without delimiter or How to split each character in Java? Answer: You just have to pass (“”) in the regEx section of the Java Split() method. This will split the entire String into individual characters.
Use the regular expression module's split()
function, to split at
'\d+'
-- digits (number characters) and'\W+'
-- non-word characters:CODE:
import re
print([i for i in re.split(r'(\d+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13', '.', '5', '*', '10', 'x', '-', '4', 'e', '1']
If you don't want to separate the dot (as a floating-point number in the expression) then you should use this:
[\d.]+
-- digit or dot characters (although this allows you to write: 13.5.5
CODE:
print([i for i in re.split(r'([\d.]+|\W+)', 'x+13.5*10x-4e1') if i])
OUTPUT:
['x', '+', '13.5', '*', '10', 'x', '-', '4', 'e', '1']
Another alternative not suggested here, is to using nltk.tokenize module
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With