I almost found the answer to this question in this thread (samplebias's answer); however I need to split a phrase into words, digits, punctuation marks, and spaces/tabs. I also need this to preserve the order in which each of these things occurs (which the code in that thread already does). So, what I've found is something like this: <pre class="prettyprint"><code> from nltk.tokenize import * txt = "Today it's 07.May 2011. Or 2.999." regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+') ['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.'] </code></pre> But this is the kind of list I need to yield: <pre class="prettyprint"><code> ['Today', ' ', 'it', "'s", ' ', '\t', '07.May', ' ', '2011', '.', ' ', 'Or', ' ', '2.999', '.'] </code></pre> Regex has always been one of my weakpoints so after a couple hours of research I'm still stumped. Thank you!!

I think that something like this should work for you. There is probably more in that regex than there needs to be, but your requirements are somewhat vague and don't exactly match up with the expected output you provided. <pre class="prettyprint"><code>>>> txt = "Today it's \t07.May 2011. Or 2.999." >>> p = re.compile(r"\d+|[-'a-z]+|[ ]+|\s+|[.,]+|\S+", re.I) >>> slice_starts = [m.start() for m in p.finditer(txt)] + [None] >>> [txt[s:e] for s, e in zip(slice_starts, slice_starts[1:])] ['Today', ' ', "it's", ' ', '\t', '07', '.', 'May', ' ', '2011', '.', ' ', 'Or', ' ', '2', '.', '999', '.'] </code></pre>

RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)

I almost found the answer to this question in this thread (samplebias's answer); however I need to split a phrase into words, digits, punctuation marks, and spaces/tabs. I also need this to preserve the order in which each of these things occurs (which the code in that thread already does).

So, what I've found is something like this:

    from nltk.tokenize import *
    txt = "Today it's   07.May 2011. Or 2.999."
    regexp_tokenize(txt, pattern=r'\w+([.,]\w+)*|\S+')
    ['Today', 'it', "'s", '07.May', '2011', '.', 'Or', '2.999', '.']

But this is the kind of list I need to yield:

    ['Today', ' ', 'it', "'s", ' ', '\t', '07.May', ' ', '2011', '.', ' ', 'Or', ' ', '2.999', '.']

Regex has always been one of my weakpoints so after a couple hours of research I'm still stumped. Thank you!!

What is Tokenizing in regex?

Regular-Expression Tokenizers. A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences: >>> from nltk.

What is word tokenization?

Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis like classifying and counting them for a particular sentiment etc.

I think that something like this should work for you. There is probably more in that regex than there needs to be, but your requirements are somewhat vague and don't exactly match up with the expected output you provided.

>>> txt = "Today it's \t07.May 2011. Or 2.999."
>>> p = re.compile(r"\d+|[-'a-z]+|[ ]+|\s+|[.,]+|\S+", re.I)
>>> slice_starts = [m.start() for m in p.finditer(txt)] + [None]
>>> [txt[s:e] for s, e in zip(slice_starts, slice_starts[1:])]
['Today', ' ', "it's", ' ', '\t', '07', '.', 'May', ' ', '2011', '.', ' ', 'Or', ' ', '2', '.', '999', '.']

RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)

Tags:

python

regex

tokenize

nltk

m_floer

People also ask

1 Answers

Andrew Clark

Recent Activity

Donate For Us

RegEx Tokenizer: split text into words, digits, punctuation, and spacing (do not delete anything)

Tags:

python

regex

tokenize

nltk

m_floer

People also ask

1 Answers

Andrew Clark

Related questions

Recent Activity

Donate For Us