How can I remove POS tags before slashes in nltk?

Question

This is part of my project where I need to represent the output after phrase detection like this - (a,x,b) where a, x, b are phrases. I constructed the code and got the output like this:

(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))

I want to make it just like the previous representation which means I have to remove 'CLAUSE', 'NP', 'VP', 'VBD', 'NNP' etc tags.

How to do that?

What I tried

First wrote this in a text file, tokenize and used list.remove('word'). But that is not at all helpful. I am clarifying a bit more.

My Input

(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP)) (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))

Output will be

[Jack,loved,Peter], [Jack,stayed,in London] The output is just according to the braces and without the tags.

alexis · Accepted Answer

Since you tagged this nltk, let's use the NLTK's tree parser to process your trees. We'll read in each tree, then simply print out the leaves. Done.

>>> text ="(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))"
>>> tree = nltk.Tree.fromstring(text, read_leaf=lambda x: x.split("/")[0])
>>> print(tree.leaves())

['Jack', 'stayed', 'in', 'London']

The lambda form splits each word/tag pair and discards the tag, keeping just the word.

Multiple trees

I know, you're going to ask me how to process a whole file's worth of such trees, and some of them take more than one line. That's the job of the NLTK's BracketParseCorpusReader, but it expects terminals to be in the form (POS word) instead of word/POS. I won't bother doing it that way, since it's even easier to trick Tree.fromstring() into reading all your trees as if they're branches of a single tree:

allmytext = """
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
"""
wrapped = "(ROOT "+ allmytext + " )"  # Add a "root" node at the top
trees = nltk.Tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])
for tree in trees:
    print(tree.leaves())

As you see, the only difference is we added "(ROOT " and " )" around the file contents, and used a for-loop to generate the output. The loop gives us the children of the top node, i.e. the actual trees.

alvas · Answer

>>> import re
>>> clause = "(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))"
>>> pattern = r'\w+:?(?=/)'
>>> re.findall(pattern, clause)
['Jack', 'loved', 'Peter']

EDITED

For multiple clauses:

>>> import re
>>> pattern = r'\w+:?(?=/)'
>>> clauses = """(CLAUSE (NP school/NN) (VP is/VBZ situated/VBN) (NP in/IN London/NNP)) (CLAUSE (NP The/DT color/NN of/IN the/DT sky/NN) (VP is/VBZ) (NP pink/NN))"""
>>> [re.findall(pattern, clause) for clause in clauses.split(' (CLAUSE ')]
[['school', 'is', 'situated', 'in', 'London'], ['The', 'color', 'of', 'the', 'sky', 'is', 'pink']]

If clauses are separated by newline:

>>> import re
>>> pattern = r'\w+:?(?=/)'
>>> clauses = """(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
... (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
... (CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))"""
>>> [re.findall(pattern, clause) for clause in clauses.split('
')]
[['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]

To join the output into a string:

>>> " ".join(['Jack', 'loved', 'Peter'])
'Jack loved Peter'

>>> clauses = [['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]
>>> [" ".join(cl) for cl in clauses]
['Jack loved Peter', 'Jack stayed in London', 'Tom is in Kolkata']

How can I remove POS tags before slashes in nltk?

Tags:

python

nlp

nltk

pos-tagger

What I tried

My Input

Output will be

Salah

2 Answers

Multiple trees

alexis

EDITED

alvas

Recent Activity

Donate For Us

How can I remove POS tags before slashes in nltk?

Tags:

python

nlp

nltk

pos-tagger

What I tried

My Input

Output will be

Salah

2 Answers

Multiple trees

alexis

EDITED

alvas

Related questions

Recent Activity

Donate For Us