Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex to extract a portion of string

I want to extract a portion of a large string. There's a target word and an upper bound on the number of words before and after that. The extracted substring must therefore contain the target word along with the upper bound words before and after it. The before and after part can contain lesser words if the target word is closer to the beginning or end of the text.

Eample string

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Target word: laboris

words_before: 5

words_after: 2

Should return ['veniam, quis nostrud exercitation ullamco laboris nisi ut']

I thought of a couple of possible patterns but none of them worked. I guess it can also be done by simply traversing the string front and back from the target word. However a regex would definitely make things easier. Any help would be appreciated.

like image 914
user2963623 Avatar asked Oct 04 '15 01:10

user2963623


People also ask

How do I extract a specific part of a string in Python?

You can extract a substring in the range start <= x < stop with [start:step] . If start is omitted, the range is from the beginning, and if end is omitted, the range is to the end. You can also use negative values. If start > end , no error is raised and an empty character '' is extracted.

How do you extract a certain part of a string?

The substr() method extracts a part of a string. The substr() method begins at a specified position, and returns a specified number of characters. The substr() method does not change the original string. To extract characters from the end of the string, use a negative start position.

How do you split a string in regex in Python?

Regex example to split a string into words In this example, we will split the target string at each white-space character using the \s special sequence. Let's add the + metacharacter at the end of \s . Now, The \s+ regex pattern will split the target string on the occurrence of one or more whitespace characters.

How do I extract a word from a string in Python with regex?

Method #2 : Using regex( findall() ) findall function returns the list after filtering the string and extracting words ignoring punctuation marks.


1 Answers

If you want to split words, you can use slice() and split() function. For example:

>>> text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
 tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, qu
is nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
 Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
 fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in 
culpa qui officia deserunt mollit anim id est laborum.".split()

>>> n = text.index('laboris')
>>> s = slice(n - 5, n + 3)

>>> text[s]
['veniam,', 'quis', 'nostrud', 'exercitation', 'ullamco', 'laboris', 'nisi', 'ut']
like image 105
Remi Crystal Avatar answered Sep 18 '22 14:09

Remi Crystal