Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting Macbeth When a Character Speaks

After sending a get request to Project Gutenberg I have the play Macbeth in its entirety as a string

response = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt')
full_text = response.text
macbeth = full_text[16648:]

I split it

words_raw = macbeth.split()
word_count = len(words_raw)

print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])

I then strip all punctuation and convert the strings to lower()

import string
punctuation = string.punctuation

words_cleaned = []

for word in words_raw:
    # remove punctuation
    word = word.strip(punctuation)
    # make lowercase
    word = word.lower()
    words_cleaned.append(word)

print("Cleaned word examples:", words_cleaned[400:460])

However, I can't strip all punctuation, because I need the periods following names/shortened names as indicators that a character is about to speak.

Excerpt from Lesson

A character speaking is indicated by an (often-abbreviated) version of their name followed by a . (period) as the first thing on a line. So for example, when Macbeth speaks it starts with "Macb." You'll need to revise how you handle punctuation, since you can't just strip all punctuation

Slice of raw data after split( )

names followed by a period in bold

Macbeth contains 17737 words Here are some examples: ['Gashes', 'cry', 'for', 'helpe', 'King.', 'So', 'well', 'thy', 'words', 'become', 'thee,', 'as', 'thy', 'wounds,', 'They', 'smack', 'of', 'Honor', 'both:', 'Goe', 'get', 'him', 'Surgeons.', 'Enter', 'Rosse', 'and', 'Angus.', 'Who', 'comes', 'here?', 'Mal.', 'The', 'worthy', 'Thane', 'of', 'Rosse', 'Lenox.', 'What', 'a', 'haste', 'lookes', 'through', 'his', 'eyes?', 'So', 'should', 'he', 'looke,', 'that', 'seemes', 'to', 'speake', 'things', 'strange', 'Rosse.', 'God', 'saue', 'the', 'King', 'King.']

words_raw = macbeth.split()
word_count = len(words_raw)

print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])

We know that 'Malcolm' is speaking when his name appears followed by a period ('Mal.' in bold above) the same is true for 'Lenox' when he starts to speak ('Lenox.') Sometimes the character's name is shortened, others use the full name followed immediately with a period.

List of most common names in Macbeth

["duncan", "malcolm", "donalbaine", "macbeth", "banquo", "macduff", "lenox", "rosse", "menteth", "angus", "cathnes", "fleance", "seyward", "seyton", "boy", "lady", "messenger", "wife"]

Goals

  • From above list identify all names and shortened names of characters, if shortened
  • find where a character starts to speak, indicated by period, and split there

Here's what I've tried so far

Attempt at Isolating Non Alphanumerics

print(len(words_raw))
def extra(string):
    return list(c for c in string if not c.isalnum() and not c.isspace())
weird = extra(macbeth)
weird
​
discard = []
for char in weird:
    if char != '.':
        discard.append(char)
print(len(weird))
print(len(discard))
print(discard)
​
revised_macbeth = []
​
for character in words_raw:
    if not character in discard:
        revised_macbeth.append(character)
print(len(revised_macbeth))
        
        
​
# for character in words_raw:
#     if not character.isalnum():
#         print("found: \'{}\'".format(character))

its output

17737
4788
3553
['?', ',', ',', '?', '-', "'", ',', "'", ',', '?', ',', '-', ':', ',', ',', ',', ',', ',', ',', ',', '?', ',', ',', ',', "'", ':', ';', ',', ',', ',', ',', ',', ':', '(', ',', ')', "'", ',', ',', "'", ':', "'", ':', '(', ')', ',', ',', "'", '(', ')', "'", ',', "'", ':', "'", ',', ',', "'", "'", ',', "'", ',', "'", ',', ',', ':', ',', "'", ',', ':', ',', ',', ',', "'", ',', "'", ',', ',', ',', ',', ',', "'", ',', '?', ',', ',', ';', ',', ':', ',', '-', "'", ',', ':', ',', ',', ':', ',', ',', ',', ':', '?', '?', ',', "'", ',', '?', ',', ',', ',', ',', ',', ',', ',', ',', "'", ',', ',', '-', ',', ',', "'", ',', ':', ',', ',', ',', ':', ',', ',', ',', ',', ':', ',', ',', ',', '?', ',', '?', ',', ',', '&', ',', ':', ',', ',', ',', '-', "'", ',', "'", "'", ':', ',', ',', ',', ',', "'", ',', ',', ',', "'", "'", '-', ':', '-', ':', ':', "'", ',', ',', ',', ',', ':', ',', '-', ',', ',', ',', ',', ':', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', "'", "'", "'", '?', ',', "'", ',', ',', "'", "'", "'", ',', "'", '?', ',', '?', ',', ':', ',', ':', '?', ',', ',', ',', ',', ',', '?', "'", "'", ',', '?', ',', ',', ',', ':', ',', ',', ',', ',', 

comparing

print(macbeth)
The Tragedie of Macbeth

Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
  2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come, Gray-Malkin
print(revised_macbeth)
['The', 'Tragedie', 'of', 'Macbeth', 'Actus', 'Primus.', 'Scoena', 'Prima.', 'Thunder', 'and', 'Lightning.', 'Enter', 'three', 'Witches.', '1.', 'When', 'shall', 'we', 'three', 'meet', 'againe?', 'In', 'Thunder,', 'Lightning,', 'or', 'in', 'Raine?', '2.', 'When', 'the', "Hurley-burley's", 'done,', 'When', 'the', "Battaile's", 'lost,', 'and', 'wonne', '3.', 'That', 'will', 'be', 'ere', 'the', 'set', 'of', 'Sunne', '1.', 'Where', 'the', 'place?', '2.', 'Vpon', 'the', 'Heath', '3.', 'There', 'to', 'meet', 'with', 'Macbeth', '1.', 'I', 'come,', 'Gray-Malkin', 'All.', 'Padock', 'calls', 'anon:', 'faire', 'is', 'foule,', 'and', 'foule', 'is', 'faire,', 'Houer', 'through', 'the', 'fogge', 'and', 'filthie', 'ayre.', 'Exeunt.', 'Scena', 'Secunda.', 'Alarum', 'within.', 'Enter', 'King,', 'Malcome,', 'Donalbaine,', 'Lenox,', 'with', 'attendants,', 'meeting', 'a', 'bleeding', 'Captaine.', 'King.', 'What', 'bloody', 'man', 'is', 'that?', 'he', 'can', 'report,', 'As', 'seemeth', 'by', 'his', 'plight,', 'of', 'the', 'Reuolt', 'The', 'newest', 'state', 'Mal.', 'This', 'is', 'the', 'Serieant,', 'Who', 'like', 'a', 'good', 'and', 'hardie', 'Souldier', 'fought', "'Gainst", 'my', 'Captiuitie:', 'Haile', 'braue', 'friend;', 'Say', 'to', 'the', 'King,', 'the', 'knowledge', 'of', 'the', 'Broyle,', 'As', 'thou', 'didst', 'leaue', 'it', 'Cap.', 'Doubtfull', 'it', 'stood,', 'As', 'two', 'spent', 'Swimmers,', 'that', 'doe', 'cling', 'together,', 'And', 'choake', 'their', 'Art:', 'The', 'me
like image 498
nate. walter Avatar asked Feb 20 '26 15:02

nate. walter


2 Answers

You can use collections.defaultdict to group the lines on the name of the speaker. enumerate can be used to get the line number for each occurrence of text uttered by a character:

import requests, re
from collections import defaultdict
r = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
d, l, keywords = defaultdict(list), None, ['Enter', 'Exit', 'Flourish', 'Thunder']
#iterate over the play lines, ignoring empty strings (generated from the split)
for i, a in filter(lambda x:x[-1], enumerate(re.split('[\n\r]+', r[r.index('Actus Primus. Scoena Prima.')+27:]))):
   #check that the line contains character dialog, not stage prompts
   if not re.findall('|'.join(keywords), a):
      #grab the name of the character and append to "d"
      if (n:=re.findall('^\s+[A-Z](?:\.[A-Z])*[a-z]+\.(?=\s\w+)|^[A-Z](?:\.[A-Z])*[a-z\.]+\.(?=\s\w+)', a)):
         d[(l:=re.sub('^\s+|\.$', '', n[0]).lower())].append((i, a[len(n[0])+1:].lower()))
      elif l:
         #the line might be a continuation of a larger block of character text
         d[l].append((i, a.lower()))

print(list(d.keys())) #detected characters
print(d['macb'][:10]) #first ten occurrences of Macbeth speaking

Output:

['all', 'king', 'mal', 'cap', 'lenox', 'rosse', 'macb', 'banquo', 'mac', 'banq', 'ang', 'lady', 'mess', 'la', 'fleance', 'porter', 'macd', 'port', 'exeunt', 'ban', 'donal', 'malc', 'don', 'ross', 'seruant', 'murth', 'lords', 'mur', 'len', 'hec', 'lord', 'appar', 'musicke', 'wife', 'son', 'mes', 'doct', 'ro', 'gent', 'lad', 'ment', 'cath', 'ser', 'sey', 'seyw', 'sold', 'syw', 'y.sey']
[(137, 'so foule and faire a day i haue not seene'), (170, 'stay you imperfect speakers, tell me more:'), (171, 'by sinells death, i know i am thane of glamis,'), (172, 'but how, of cawdor? the thane of cawdor liues'), (173, 'a prosperous gentleman: and to be king,'), (174, 'stands not within the prospect of beleefe,'), (175, 'no more then to be cawdor. say from whence'), (176, 'you owe this strange intelligence, or why'), (177, 'vpon this blasted heath you stop our way'), (178, 'with such prophetique greeting?')]

Edit: common words per character:

To filter common words per character, iterate over the sentences for each character in d, and then iterate again over the str.split results from each sentence. It is important to note that the results from the previous step will contain many stop words. My solution below gives you the option to filter these:

from collections import Counter
def common_words(character, filter_stop = False, stop_words = []):
   if filter_stop:
      stop_words = set(filter(None, requests.get("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords").text.split('\n')))
   w = [i for _, b in d['Macb'] for i in re.sub('[\:\.\?]+', '', b).split() if i.lower() not in stop_words]
   return Counter(w).most_common(5)

print(common_words('Macb', filter_stop=True))

Output:

[('haue', 39), ('thou', 34), ('thy', 23), ('shall', 21), ('thee', 20)]
like image 55
Ajax1234 Avatar answered Feb 23 '26 05:02

Ajax1234


Following my comment above

You might have an easier time of it if you split into lines first, and then split into words, because I expect the abbreviated character names will always be at the start of a line? Also, I notice the line is indented a couple spaces when a new character starts speaking. That could be another thing to look for.

Split into lines:

macbeth_lines = macbeth.split('\r\n') # Because in your text lines are separated by \r\n

Then, loop over each line. If it starts with a space, remove everything but periods from the first word, and remove all punctuation from the others. If it doesn't start with a space, remove all punctuation from all words. To replace all characters, we'll use str.translate() (docs), which takes a dict mapping each input character to its translated output character. We can create this dict to map every punctuation character to an empty string.

# Create a dictionary for str.translate
strip_chars = {ord(punct): None for punct in string.punctuation}

# And one without the period
strip_chars_no_period = {k: v for k, v in strip_chars.items() if k != 46} # 46 is ord('.')

macbeth_words = []
for line in macbeth_lines:
    line_words = line.split()
    line_proc_words = [] # List to see each line as it's processed
                         # Remove if not needed

    if line.startswith(" "):
        # this line starts with a space. Maybe it contains a name

        # Don't strip periods from the first word
        first_word = line_words[0].translate(strip_chars_no_period)

        line_proc_words.append(first_word) # Debug line

        # Save the word
        macbeth_words.append(first_word)

        # Remaining words yet to be processed in this line
        remaining_words = line_words[1:]
    else:
        # All words in the line are yet to be processed
        remaining_words = line_words

    # Process remaining words
    for other_word in remaining_words:
        # Strip punctuation
        stripped_word = other_word.translate(strip_chars)

        line_proc_words.append(stripped_word) # Debug line

        # Save to list
        macbeth_words.append(stripped_word)
    
    # Print out the line just to make sure it's correct
    print(' '.join(line_proc_words)) # Debug line

I added a line_proc_words list so that we can print each line as it's processed. The output of the code above (I ran it only for the first 100 lines) looks like so:

The Tragedie of Macbeth

Actus Primus Scoena Prima

Thunder and Lightning Enter three Witches

1. When shall we three meet againe
In Thunder Lightning or in Raine
2. When the Hurleyburleys done
When the Battailes lost and wonne

3. That will be ere the set of Sunne

1. Where the place
2. Vpon the Heath

3. There to meet with Macbeth

1. I come GrayMalkin

All. Padock calls anon faire is foule and foule is faire
Houer through the fogge and filthie ayre

Exeunt


Scena Secunda

Alarum within Enter King Malcome Donalbaine Lenox with
attendants meeting a bleeding Captaine

King. What bloody man is that he can report
As seemeth by his plight of the Reuolt
The newest state

Mal. This is the Serieant
Who like a good and hardie Souldier fought
Gainst my Captiuitie Haile braue friend
Say to the King the knowledge of the Broyle
As thou didst leaue it

Cap. Doubtfull it stood
As two spent Swimmers that doe cling together
And choake their Art The mercilesse Macdonwald
Worthie to be a Rebell for to that
The multiplying Villanies of Nature
Doe swarme vpon him from the Westerne Isles
Of Kernes and Gallowgrosses is supplyd
And Fortune on his damned Quarry smiling
Shewd like a Rebells Whore but alls too weake
For braue Macbeth well hee deserues that Name
Disdayning Fortune with his brandisht Steele
Which smoakd with bloody execution
Like Valours Minion carud out his passage
Till hee facd the Slaue
Which neur shooke hands nor bad farwell to him
Till he vnseamd him from the Naue toth Chops
And fixd his Head vpon our Battlements

King. O valiant Cousin worthy Gentleman

Cap. As whence the Sunne gins his reflection
Shipwracking Stormes and direfull Thunders
So from that Spring whence comfort seemd to come
Discomfort swells Marke King of Scotland marke
No sooner Iustice had with Valour armd
Compelld these skipping Kernes to trust their heeles
But the Norweyan Lord surueying vantage
With furbusht Armes and new supplyes of men
Began a fresh assault

King. Dismayd not this our Captaines Macbeth and
Banquoh
Cap. Yes as Sparrowes Eagles
Or the Hare the Lyon
If I say sooth I must report they were
As Cannons ouerchargd with double Cracks
So they doubly redoubled stroakes vpon the Foe
Except they meant to bathe in reeking Wounds
Or memorize another Golgotha
I cannot tell but I am faint
My Gashes cry for helpe

King. So well thy words become thee as thy wounds
They smack of Honor both Goe get him Surgeons
Enter Rosse and Angus

Who comes here
Mal. The worthy Thane of Rosse

Lenox. What a haste lookes through his eyes
So should he looke that seemes to speake things strange

Rosse. God saue the King

King. Whence camst thou worthy Thane
Rosse. From Fiffe great King
Where the Norweyan Banners flowt the Skie
And fanne our people cold
Norway himselfe with terrible numbers
like image 38
Pranav Hosangadi Avatar answered Feb 23 '26 05:02

Pranav Hosangadi