Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract surrounding words in python from a string position

Let's assume, I have a string:

string="""<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p> <p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>"""

and I have a position of word in this string, for example:

>>> pos = [m.start() for m in re.finditer("tells you", string)]
>>> pos
[263, 588]

I need to extract several words behind and several words after from each position. How to implement it using Python and regular expressions?

E.g.:

def look_through(d, s):
    r = []
    content = readFile(d["path"])
    content = BeautifulSoup(content)
    content = content.getText()
    pos = [m.start() for m in re.finditer(s, content)]
    if pos:
        if "phrase" not in d:
            d["phrase"] = [s]
        else:
            d["phrase"].append(s)
        for p in pos:
            r.append({"content": content, "phrase": d["phrase"], "name": d["name"]})
    for b in d["decendent"] or []:
            r += look_through(b, s)
    return r

>>> dict = {
    "content": """<p>It is common for content in Arabic, Hebrew, and other languages that use right-to-left scripts to include numerals or include text from  other scripts. Both of these typically flow  left-to-right within the overall right-to-left  context. </p>""", 
    "name": "directory", 
    "decendent": [
         {
            "content": """<p>This article tells you how to write HTML where text with different writing directions is mixed <em>within a paragraph or other HTML block</em> (ie. <dfn id="term_inline">inline or phrasal</dfn> content). (A companion article <a href="/International/questions/qa-html-dir"><cite>Structural markup and right-to-left text in HTML</cite></a> tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)</p>""", 
            "name": "subdirectory", 
            "decendent": None
        }, 
        {
            "content": """It tells you how to use HTML markup for  elements such as <code class="kw">html</code>, and structural markup such as <code class="kw">p</code> or <code class="kw">div</code> and forms.)""", 
            "name": "subdirectory_two", 
            "decendent": [
                {
                    "content": "Name 4", 
                    "name": "subsubdirectory", 
                    "decendent": None
                }
            ]
        }
    ]
}

So:

>>> look_through(dict, "tells you")
[
    { "content": "This article tells you how to", "phrase": "tells you", "name": "subdirectory" },
    { "content": "It tells you how to use", "phrase": "tells you", "name": "subdirectory_two" }
]

Thank you!

like image 686
Павел Иванов Avatar asked Nov 27 '25 16:11

Павел Иванов


1 Answers

You want a "concordance" of your regexp hits, let's say two words before and after the place where your regexps matched. The easiest way to do it is to break your string there and anchor your search to the endpoints of the pieces. For example, to get two words before and after index 263 (your first m.start()), you'd do:

m_left = re.search(r"(?:\s+\S+){,2}\s+\S*$", text[:263])
m_right = re.search(r"^\S*\s+(?:\S+\s+){2,}", text[263:])
print(text[m_left.start():m_right.end()])

The first expression should be read from the end of the string backwards: It anchors at the end $, possibly skips a partial word if the match ended mid-word, (\S*), skips some spaces (\s+), and then matches up to two {2,} word-space sequences, \s+\S+. It's not exactly two because if we reach the beginning of the string, we want to return a short match.

The second regexp does the same but in reverse direction.

For a concordance you'd probably want to start reading right after the end of the regexp match, not the beginning. In that case, use m.end() as the beginning of the second string.

It's pretty obvious how to use this with a list of regexp matches, I think.

like image 178
alexis Avatar answered Nov 29 '25 05:11

alexis



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!