After sending a get request to Project Gutenberg I have the play Macbeth in its entirety as a string
response = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt')
full_text = response.text
macbeth = full_text[16648:]
I split it
words_raw = macbeth.split()
word_count = len(words_raw)
print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])
I then strip all punctuation and convert the strings to lower()
import string
punctuation = string.punctuation
words_cleaned = []
for word in words_raw:
# remove punctuation
word = word.strip(punctuation)
# make lowercase
word = word.lower()
words_cleaned.append(word)
print("Cleaned word examples:", words_cleaned[400:460])
However, I can't strip all punctuation, because I need the periods following names/shortened names as indicators that a character is about to speak.
A character speaking is indicated by an (often-abbreviated) version of their name followed by a . (period) as the first thing on a line. So for example, when Macbeth speaks it starts with "Macb." You'll need to revise how you handle punctuation, since you can't just strip all punctuation
Macbeth contains 17737 words Here are some examples: ['Gashes', 'cry', 'for', 'helpe', 'King.', 'So', 'well', 'thy', 'words', 'become', 'thee,', 'as', 'thy', 'wounds,', 'They', 'smack', 'of', 'Honor', 'both:', 'Goe', 'get', 'him', 'Surgeons.', 'Enter', 'Rosse', 'and', 'Angus.', 'Who', 'comes', 'here?', 'Mal.', 'The', 'worthy', 'Thane', 'of', 'Rosse', 'Lenox.', 'What', 'a', 'haste', 'lookes', 'through', 'his', 'eyes?', 'So', 'should', 'he', 'looke,', 'that', 'seemes', 'to', 'speake', 'things', 'strange', 'Rosse.', 'God', 'saue', 'the', 'King', 'King.']
words_raw = macbeth.split()
word_count = len(words_raw)
print("Macbeth contains {} words".format(word_count))
print("Here are some examples:", words_raw[400:460])
We know that 'Malcolm' is speaking when his name appears followed by a period ('Mal.' in bold above) the same is true for 'Lenox' when he starts to speak ('Lenox.') Sometimes the character's name is shortened, others use the full name followed immediately with a period.
["duncan", "malcolm", "donalbaine", "macbeth", "banquo", "macduff", "lenox", "rosse", "menteth", "angus", "cathnes", "fleance", "seyward", "seyton", "boy", "lady", "messenger", "wife"]
Attempt at Isolating Non Alphanumerics
print(len(words_raw))
def extra(string):
return list(c for c in string if not c.isalnum() and not c.isspace())
weird = extra(macbeth)
weird
discard = []
for char in weird:
if char != '.':
discard.append(char)
print(len(weird))
print(len(discard))
print(discard)
revised_macbeth = []
for character in words_raw:
if not character in discard:
revised_macbeth.append(character)
print(len(revised_macbeth))
# for character in words_raw:
# if not character.isalnum():
# print("found: \'{}\'".format(character))
its output
17737
4788
3553
['?', ',', ',', '?', '-', "'", ',', "'", ',', '?', ',', '-', ':', ',', ',', ',', ',', ',', ',', ',', '?', ',', ',', ',', "'", ':', ';', ',', ',', ',', ',', ',', ':', '(', ',', ')', "'", ',', ',', "'", ':', "'", ':', '(', ')', ',', ',', "'", '(', ')', "'", ',', "'", ':', "'", ',', ',', "'", "'", ',', "'", ',', "'", ',', ',', ':', ',', "'", ',', ':', ',', ',', ',', "'", ',', "'", ',', ',', ',', ',', ',', "'", ',', '?', ',', ',', ';', ',', ':', ',', '-', "'", ',', ':', ',', ',', ':', ',', ',', ',', ':', '?', '?', ',', "'", ',', '?', ',', ',', ',', ',', ',', ',', ',', ',', "'", ',', ',', '-', ',', ',', "'", ',', ':', ',', ',', ',', ':', ',', ',', ',', ',', ':', ',', ',', ',', '?', ',', '?', ',', ',', '&', ',', ':', ',', ',', ',', '-', "'", ',', "'", "'", ':', ',', ',', ',', ',', "'", ',', ',', ',', "'", "'", '-', ':', '-', ':', ':', "'", ',', ',', ',', ',', ':', ',', '-', ',', ',', ',', ',', ':', ',', ',', ',', ',', ',', ',', ',', ',', ',', ',', "'", "'", "'", '?', ',', "'", ',', ',', "'", "'", "'", ',', "'", '?', ',', '?', ',', ':', ',', ':', '?', ',', ',', ',', ',', ',', '?', "'", "'", ',', '?', ',', ',', ',', ':', ',', ',', ',', ',',
print(macbeth)
The Tragedie of Macbeth
Actus Primus. Scoena Prima.
Thunder and Lightning. Enter three Witches.
1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
2. When the Hurley-burley's done,
When the Battaile's lost, and wonne
3. That will be ere the set of Sunne
1. Where the place?
2. Vpon the Heath
3. There to meet with Macbeth
1. I come, Gray-Malkin
print(revised_macbeth)
['The', 'Tragedie', 'of', 'Macbeth', 'Actus', 'Primus.', 'Scoena', 'Prima.', 'Thunder', 'and', 'Lightning.', 'Enter', 'three', 'Witches.', '1.', 'When', 'shall', 'we', 'three', 'meet', 'againe?', 'In', 'Thunder,', 'Lightning,', 'or', 'in', 'Raine?', '2.', 'When', 'the', "Hurley-burley's", 'done,', 'When', 'the', "Battaile's", 'lost,', 'and', 'wonne', '3.', 'That', 'will', 'be', 'ere', 'the', 'set', 'of', 'Sunne', '1.', 'Where', 'the', 'place?', '2.', 'Vpon', 'the', 'Heath', '3.', 'There', 'to', 'meet', 'with', 'Macbeth', '1.', 'I', 'come,', 'Gray-Malkin', 'All.', 'Padock', 'calls', 'anon:', 'faire', 'is', 'foule,', 'and', 'foule', 'is', 'faire,', 'Houer', 'through', 'the', 'fogge', 'and', 'filthie', 'ayre.', 'Exeunt.', 'Scena', 'Secunda.', 'Alarum', 'within.', 'Enter', 'King,', 'Malcome,', 'Donalbaine,', 'Lenox,', 'with', 'attendants,', 'meeting', 'a', 'bleeding', 'Captaine.', 'King.', 'What', 'bloody', 'man', 'is', 'that?', 'he', 'can', 'report,', 'As', 'seemeth', 'by', 'his', 'plight,', 'of', 'the', 'Reuolt', 'The', 'newest', 'state', 'Mal.', 'This', 'is', 'the', 'Serieant,', 'Who', 'like', 'a', 'good', 'and', 'hardie', 'Souldier', 'fought', "'Gainst", 'my', 'Captiuitie:', 'Haile', 'braue', 'friend;', 'Say', 'to', 'the', 'King,', 'the', 'knowledge', 'of', 'the', 'Broyle,', 'As', 'thou', 'didst', 'leaue', 'it', 'Cap.', 'Doubtfull', 'it', 'stood,', 'As', 'two', 'spent', 'Swimmers,', 'that', 'doe', 'cling', 'together,', 'And', 'choake', 'their', 'Art:', 'The', 'me
You can use collections.defaultdict to group the lines on the name of the speaker. enumerate can be used to get the line number for each occurrence of text uttered by a character:
import requests, re
from collections import defaultdict
r = requests.get('https://www.gutenberg.org/cache/epub/2264/pg2264.txt').text
d, l, keywords = defaultdict(list), None, ['Enter', 'Exit', 'Flourish', 'Thunder']
#iterate over the play lines, ignoring empty strings (generated from the split)
for i, a in filter(lambda x:x[-1], enumerate(re.split('[\n\r]+', r[r.index('Actus Primus. Scoena Prima.')+27:]))):
#check that the line contains character dialog, not stage prompts
if not re.findall('|'.join(keywords), a):
#grab the name of the character and append to "d"
if (n:=re.findall('^\s+[A-Z](?:\.[A-Z])*[a-z]+\.(?=\s\w+)|^[A-Z](?:\.[A-Z])*[a-z\.]+\.(?=\s\w+)', a)):
d[(l:=re.sub('^\s+|\.$', '', n[0]).lower())].append((i, a[len(n[0])+1:].lower()))
elif l:
#the line might be a continuation of a larger block of character text
d[l].append((i, a.lower()))
print(list(d.keys())) #detected characters
print(d['macb'][:10]) #first ten occurrences of Macbeth speaking
Output:
['all', 'king', 'mal', 'cap', 'lenox', 'rosse', 'macb', 'banquo', 'mac', 'banq', 'ang', 'lady', 'mess', 'la', 'fleance', 'porter', 'macd', 'port', 'exeunt', 'ban', 'donal', 'malc', 'don', 'ross', 'seruant', 'murth', 'lords', 'mur', 'len', 'hec', 'lord', 'appar', 'musicke', 'wife', 'son', 'mes', 'doct', 'ro', 'gent', 'lad', 'ment', 'cath', 'ser', 'sey', 'seyw', 'sold', 'syw', 'y.sey']
[(137, 'so foule and faire a day i haue not seene'), (170, 'stay you imperfect speakers, tell me more:'), (171, 'by sinells death, i know i am thane of glamis,'), (172, 'but how, of cawdor? the thane of cawdor liues'), (173, 'a prosperous gentleman: and to be king,'), (174, 'stands not within the prospect of beleefe,'), (175, 'no more then to be cawdor. say from whence'), (176, 'you owe this strange intelligence, or why'), (177, 'vpon this blasted heath you stop our way'), (178, 'with such prophetique greeting?')]
Edit: common words per character:
To filter common words per character, iterate over the sentences for each character in d, and then iterate again over the str.split results from each sentence. It is important to note that the results from the previous step will contain many stop words. My solution below gives you the option to filter these:
from collections import Counter
def common_words(character, filter_stop = False, stop_words = []):
if filter_stop:
stop_words = set(filter(None, requests.get("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords").text.split('\n')))
w = [i for _, b in d['Macb'] for i in re.sub('[\:\.\?]+', '', b).split() if i.lower() not in stop_words]
return Counter(w).most_common(5)
print(common_words('Macb', filter_stop=True))
Output:
[('haue', 39), ('thou', 34), ('thy', 23), ('shall', 21), ('thee', 20)]
Following my comment above
You might have an easier time of it if you split into lines first, and then split into words, because I expect the abbreviated character names will always be at the start of a line? Also, I notice the line is indented a couple spaces when a new character starts speaking. That could be another thing to look for.
Split into lines:
macbeth_lines = macbeth.split('\r\n') # Because in your text lines are separated by \r\n
Then, loop over each line. If it starts with a space, remove everything but periods from the first word, and remove all punctuation from the others. If it doesn't start with a space, remove all punctuation from all words. To replace all characters, we'll use str.translate() (docs), which takes a dict mapping each input character to its translated output character. We can create this dict to map every punctuation character to an empty string.
# Create a dictionary for str.translate
strip_chars = {ord(punct): None for punct in string.punctuation}
# And one without the period
strip_chars_no_period = {k: v for k, v in strip_chars.items() if k != 46} # 46 is ord('.')
macbeth_words = []
for line in macbeth_lines:
line_words = line.split()
line_proc_words = [] # List to see each line as it's processed
# Remove if not needed
if line.startswith(" "):
# this line starts with a space. Maybe it contains a name
# Don't strip periods from the first word
first_word = line_words[0].translate(strip_chars_no_period)
line_proc_words.append(first_word) # Debug line
# Save the word
macbeth_words.append(first_word)
# Remaining words yet to be processed in this line
remaining_words = line_words[1:]
else:
# All words in the line are yet to be processed
remaining_words = line_words
# Process remaining words
for other_word in remaining_words:
# Strip punctuation
stripped_word = other_word.translate(strip_chars)
line_proc_words.append(stripped_word) # Debug line
# Save to list
macbeth_words.append(stripped_word)
# Print out the line just to make sure it's correct
print(' '.join(line_proc_words)) # Debug line
I added a line_proc_words list so that we can print each line as it's processed. The output of the code above (I ran it only for the first 100 lines) looks like so:
The Tragedie of Macbeth
Actus Primus Scoena Prima
Thunder and Lightning Enter three Witches
1. When shall we three meet againe
In Thunder Lightning or in Raine
2. When the Hurleyburleys done
When the Battailes lost and wonne
3. That will be ere the set of Sunne
1. Where the place
2. Vpon the Heath
3. There to meet with Macbeth
1. I come GrayMalkin
All. Padock calls anon faire is foule and foule is faire
Houer through the fogge and filthie ayre
Exeunt
Scena Secunda
Alarum within Enter King Malcome Donalbaine Lenox with
attendants meeting a bleeding Captaine
King. What bloody man is that he can report
As seemeth by his plight of the Reuolt
The newest state
Mal. This is the Serieant
Who like a good and hardie Souldier fought
Gainst my Captiuitie Haile braue friend
Say to the King the knowledge of the Broyle
As thou didst leaue it
Cap. Doubtfull it stood
As two spent Swimmers that doe cling together
And choake their Art The mercilesse Macdonwald
Worthie to be a Rebell for to that
The multiplying Villanies of Nature
Doe swarme vpon him from the Westerne Isles
Of Kernes and Gallowgrosses is supplyd
And Fortune on his damned Quarry smiling
Shewd like a Rebells Whore but alls too weake
For braue Macbeth well hee deserues that Name
Disdayning Fortune with his brandisht Steele
Which smoakd with bloody execution
Like Valours Minion carud out his passage
Till hee facd the Slaue
Which neur shooke hands nor bad farwell to him
Till he vnseamd him from the Naue toth Chops
And fixd his Head vpon our Battlements
King. O valiant Cousin worthy Gentleman
Cap. As whence the Sunne gins his reflection
Shipwracking Stormes and direfull Thunders
So from that Spring whence comfort seemd to come
Discomfort swells Marke King of Scotland marke
No sooner Iustice had with Valour armd
Compelld these skipping Kernes to trust their heeles
But the Norweyan Lord surueying vantage
With furbusht Armes and new supplyes of men
Began a fresh assault
King. Dismayd not this our Captaines Macbeth and
Banquoh
Cap. Yes as Sparrowes Eagles
Or the Hare the Lyon
If I say sooth I must report they were
As Cannons ouerchargd with double Cracks
So they doubly redoubled stroakes vpon the Foe
Except they meant to bathe in reeking Wounds
Or memorize another Golgotha
I cannot tell but I am faint
My Gashes cry for helpe
King. So well thy words become thee as thy wounds
They smack of Honor both Goe get him Surgeons
Enter Rosse and Angus
Who comes here
Mal. The worthy Thane of Rosse
Lenox. What a haste lookes through his eyes
So should he looke that seemes to speake things strange
Rosse. God saue the King
King. Whence camst thou worthy Thane
Rosse. From Fiffe great King
Where the Norweyan Banners flowt the Skie
And fanne our people cold
Norway himselfe with terrible numbers
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With