Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stripping proper nouns from text

I have a df with several thousand rows of text data. I'm using spaCy to do some NLP on a single column of that df and and am trying to remove proper nouns, stop words, and punctuation from my text data using the following:

tokens = []
lemma = []
pos = []

for doc in nlp.pipe(df['TIP_all_txt'].astype('unicode').values, batch_size=9845,
                        n_threads=3):
    if doc.is_parsed:
        tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
        lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
        pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
    else:
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

df['s_tokens_all_txt'] = tokens
df['s_lemmas_all_txt'] = lemma
df['s_pos_all_txt'] = pos

df.head()

But I get this error and I'm not sure why:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-34-73578fd46847> in <module>()
      6                         n_threads=3):
      7     if doc.is_parsed:
----> 8         tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
      9         lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
     10         pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])

<ipython-input-34-73578fd46847> in <listcomp>(.0)
      6                         n_threads=3):
      7     if doc.is_parsed:
----> 8         tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
      9         lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])
     10         pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.is_propn])

AttributeError: 'spacy.tokens.token.Token' object has no attribute 'is_propn'

If I take out the not n.is_propn the code runs as expected. I've googled around and read the spaCy documentation, but haven't been able to find an answer thus far.

like image 943
LMGagne Avatar asked Mar 08 '23 05:03

LMGagne


2 Answers

I don't see is_propn attribute available on the Token object.

I think you should be checking the Part of Speech type to be PROPN (reference):

from spacy.parts_of_speech import PROPN

def is_proper_noun(token):
    if token.doc.is_tagged is False:  # check if the document was POS-tagged
        raise ValueError('token is not POS-tagged')

    return token.pos == PROPN
like image 90
alecxe Avatar answered Mar 09 '23 17:03

alecxe


Adding on to @alecxe answer.

There's no need to

  • populate all the rows of dataframe at one go.
  • get separate tokens, lemmas and pos lists when populating the dataframe.

You can try:

df = pd.DataFrame(columns=['tokens', 'lemmas', 'pos'])

annotated_docs = nlp.pipe(df['TIP_all_txt'].astype('unicode').values,
                          batch_size=9845, n_threads=3)

for doc in annotated_docs:
    if doc.is_parsed:
        # Remove the tokens that you don't want.
        tokens, lemmas, pos = zip(*[(tok.text, tok.lemma_, tok.pos_) 
                                    for tok in doc if not
                                    (tok.is_punct or tok.is_stop 
                                     or tok.is_space or is_proper_noun(tok) )
                                   ]
                                  )
        # Populate the DataFrame.
        df.append({'tokens':tokens, 'lemmas':lemmas, 'pos':pos})

And here's a neater pandas trick from how to split column of tuples in pandas dataframe? but the dataframe will take up more memory:

df = pd.DataFrame(columns=['Tokens'])

annotated_docs = nlp.pipe(df['TIP_all_txt'].astype('unicode').values,
                          batch_size=9845, n_threads=3)

for doc in annotated_docs:
    if doc.is_parsed:
        # Remove the tokens that you don't want.
        df.append([(tok.text, tok.lemma_, tok.pos_) 
                    for tok in doc if not
                    (tok.is_punct or tok.is_stop 
                     or tok.is_space or is_proper_noun(tok) )
                   ]
                  )

df[['tokens', 'lemmas', 'pos']] = df['Tokens'].apply(pd.Series)
like image 23
alvas Avatar answered Mar 09 '23 19:03

alvas