Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python library to generate regular expressions

Tags:

python

regex

Is there any lib out there that can take a text (like a html document) and a list of strings (like the name of some products) and then find a pattern in the list of strings and generate a regular expression that would extract all the strings in the text (html document) that match the pattern it found?

For example, given the following html:

<table>
  <tr>
    <td>Product 1</td>
    <td>Product 2</td>
    <td>Product 3</td>
    <td>Product 4</td>
    <td>Product 5</td>
    <td>Product 6</td>
    <td>Product 7</td>
    <td>Product 8</td>
  </tr>
</table>

and the following list of strings:

['Product 1', 'Product 2', 'Product 3']

I'd like a function that would build a regex like the following:

'<td>(.*?)</td>'

and then extract all the information from the html that match the regex. In this case, the output would be:

['Product 1', 'Product 2', 'Product 3', 'Product 4', 'Product 5', 'Product 6', 'Product 7', 'Product 8']

CLARIFICATION:

I'd like the function to look at the surrounding of the samples, not at the samples themselves. So, for example, if the html was:

<tr>
  <td>Word</td>
  <td>More words</td>
  <td>101</td>
  <td>-1-0-1-</td>
</tr>

and the samples ['Word', 'More words'] I'd like it to extract:

['Word', 'More words', '101', '-1-0-1-']
like image 303
Ionut Hulub Avatar asked Jul 19 '13 15:07

Ionut Hulub


People also ask

Which library is used for regular expression in Python?

Python has a built-in package called re , which can be used to work with Regular Expressions.

Which module is used for regular expressions?

The Python "re" module provides regular expression support.

Does Python have RegEx?

Regex is provided by many programming languages, such as python, java, javascript, etc.


1 Answers

Your requirement is at the same time very specific and very general.

I don't think you would ever find any library for your purpose unless you write your own.

On the other hand, if you spend too much time writing regex, you could use some GUI tools to help you build them, like: http://www.regular-expressions.info/regexmagic.html

However, if you need to extract data from html documents only, you should consider using an html parser, it should make things a lot easier.

I recommend beautifulsoup for parsing html document in python: https://pypi.python.org/pypi/beautifulsoup4/4.2.1

like image 67
Benjamin Toueg Avatar answered Oct 09 '22 01:10

Benjamin Toueg