Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to find a large string between a start point and end point using regex

I have a big chunk of text that I'm checking for a specific pattern, which looks essentially like this:

     unique_options_search = new Set([
            "updates_EO_LTB",
            "us_history",
            "uslegacy",

etc., etc., etc.

        ]);

      $input.typeahead({
        source: [...unique_options_search],
        autoSelect: false,
        afterSelect: function(value) 

My text variable is named 'html_page' and my start and end points look like this:

start = "new Set(["
end = "]);"

I thought I could find what I want with this one-liner:

r = re.findall("start(.+?)end",html_page,re.MULTILINE)

However, it's not returning anything at all. What is wrong here? I saw other examples online that worked fine.

like image 614
ASH Avatar asked Dec 20 '18 21:12

ASH


Video Answer


1 Answers

There are multiple problems here.

  1. As mentioned by @EthanK in comments, "start(.+?)end" in Python is a string which describes regex which literally matches start, then something, and then literally matches end. Variables start and end do not matter here at all. You've probably meant to write start + "(.+?)" + end here instead.
  2. . in Python does not match newlines. re.MULTILINE does not matter here, it only changes behavior of ^ and $ (see docs). You should use re.DOTALL instead (see docs).
  3. Values of start and end include characters with special meaning in regex (e.g. ( and [). You have to make sure they're not treated specially. You can either escape them manually with the right number of \ or simply delegate that work to re.escape to get regular expression which literally matches what you need.

Combining all that together:

import re
html_page = """
     unique_options_search = new Set([
            "oecd_updates_EO_LTB",
            "us_history",
            "us_legacy",

etc., etc., etc.

        ]);

      $input.typeahead({
        source: [...unique_options_search],
        autoSelect: false,
        afterSelect: function(value) 
"""

start = "new Set(["
end = "]);"
# r = re.findall("start(.+?)end",html_page,re.MULTILINE)  # Old version
r = re.findall(re.escape(start) + "(.+?)" + re.escape(end), html_page, re.DOTALL)  # New version
print(r)
like image 120
yeputons Avatar answered Oct 27 '22 01:10

yeputons