I want to extract a portion of a large string. There's a target word and an upper bound on the number of words before and after that. The extracted substring must therefore contain the target word along with the upper bound words before and after it. The before and after part can contain lesser words if the target word is closer to the beginning or end of the text.
Eample string
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
Target word: laboris
words_before: 5
words_after: 2
Should return ['veniam, quis nostrud exercitation ullamco laboris nisi ut']
I thought of a couple of possible patterns but none of them worked. I guess it can also be done by simply traversing the string front and back from the target word. However a regex would definitely make things easier. Any help would be appreciated.
You can extract a substring in the range start <= x < stop with [start:step] . If start is omitted, the range is from the beginning, and if end is omitted, the range is to the end. You can also use negative values. If start > end , no error is raised and an empty character '' is extracted.
The substr() method extracts a part of a string. The substr() method begins at a specified position, and returns a specified number of characters. The substr() method does not change the original string. To extract characters from the end of the string, use a negative start position.
Regex example to split a string into words In this example, we will split the target string at each white-space character using the \s special sequence. Let's add the + metacharacter at the end of \s . Now, The \s+ regex pattern will split the target string on the occurrence of one or more whitespace characters.
Method #2 : Using regex( findall() ) findall function returns the list after filtering the string and extracting words ignoring punctuation marks.
If you want to split words, you can use slice()
and split()
function. For example:
>>> text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, qu
is nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.".split()
>>> n = text.index('laboris')
>>> s = slice(n - 5, n + 3)
>>> text[s]
['veniam,', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With