Extract words from text file

Question

I am working with recursive neural networks and need to process my input text file (containing trees) to extract words. The input file looks like :

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

(4 (4 (4 (2 The) (4 (3 gorgeously) (3 (2 elaborate) (2 continuation)))) (2 (2 (2 of) (2 ``)) (2 (2 The) (2 (2 (2 Lord) (2 (2 of) (2 (2 the) (2 Rings)))) (2 (2 '') (2 trilogy)))))) (2 (3 (2 (2 is) (2 (2 so) (2 huge))) (2 (2 that) (3 (2 (2 (2 a) (2 column)) (2 (2 of) (2 words))) (2 (2 (2 (2 can) (1 not)) (3 adequately)) (2 (2 describe) (2 (3 (2 (2 co-writer/director) (2 (2 Peter) (3 (2 Jackson) (2 's)))) (3 (2 expanded) (2 vision))) (2 (2 of) (2 (2 (2 J.R.R.) (2 (2 Tolkien) (2 's))) (2 Middle-earth))))))))) (2 .)))

As an output I want the list of words in new text file as :

The

Rock

is

destined

...

(Ignore the spaces in between lines.)

I tried doing it in python but could not arrive at a solution. Also, I read that awk can be used for text processing but was unable to produce any working code. Any help is appreciated.

Engineero · Accepted Answer

You can use regex!

import re
my_string = # your string from above
pattern = r"\(\d\s+('?\w+)"
results = re.findall(pattern, my_string)
print(results)
# ['The',
#  'Rock',
#  'is',
#  'destined',
#  'to',
#  'be',
#  'the',
# ...

Note that re.findall will return a list of matches, so if you want to print them all out in a single sentence, you can use:

' '.join(results)

or whatever other character you want to separate words with instead of a blank space.

Breaking the regular expression pattern down we have:

pattern = r"""
           \(           # match opening parenthesis
             \d         # match a number. If the numbers can be >9, use \d+
               \s+      # match one or more white space characters
                  (     # begin capturing group (only return stuff inside these parentheses)
                   '?   # match zero or one apostrophes (so we don't miss posessives)
                   \w+  # match one or more text characters
                  )     # end capture group
           """

Ajax1234 · Answer

You can use re.findall:

import re
with open('tree_file.txt') as f, open('word_list.txt', 'a') as f1:
   f1.write('
'.join(set(re.findall("[a-zA-Z\-\.'/]+", f.read()))))

When running the code above on the text, the output is:

make
not
gorgeously
the
Conan
than
so
huge
and
co-writer/director
Peter
st
is
can
Schwarzenegger
expanded
even
trilogy
Middle-earth
Segal
continuation
column
vision
's
he
''
Damme
adequately
that
greater
Steven
Rock
Jackson
Rings
a
Tolkien
Van
be
words
going
to
new
Jean-Claud
or
elaborate
of
splash
Lord
The
Arnold
describe
destined
J.R.R.
Century

Extract words from text file

Tags:

python

awk

bkshi

2 Answers

Engineero

Ajax1234

Recent Activity

Donate For Us

Extract words from text file

Tags:

python

awk

bkshi

2 Answers

Engineero

Ajax1234

Related questions

Recent Activity

Donate For Us