Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting specific src attributes from script tags

Tags:

python

regex

I want to get JS file names from the input content which contains jquery as a substring by RE.

This is my code:

Step 1: Extract JS file from the content.

>>> data = """    <script type="text/javascript" src="js/jquery-1.9.1.min.js"/>
...     <script type="text/javascript" src="js/jquery-migrate-1.2.1.min.js"/>
...     <script type="text/javascript" src="js/jquery-ui.min.js"/>
...     <script type="text/javascript" src="js/abc_bsub.js"/>
...     <script type="text/javascript" src="js/abc_core.js"/>
...     <script type="text/javascript" src="js/abc_explore.js"/>
...     <script type="text/javascript" src="js/abc_qaa.js"/>"""
>>> import re
>>> re.findall('src="js/([^"]+)"', data)
['jquery-1.9.1.min.js', 'jquery-migrate-1.2.1.min.js', 'jquery-ui.min.js', 'abc_bsub.js', 'abc_core.js', 'abc_explore.js', 'abc_qaa.js']

Step 2: Get JS file which have sub string as jquery

>>> [ii for ii in re.findall('src="js/([^"]+)"', data) if "jquery" in ii]
['jquery-1.9.1.min.js', 'jquery-migrate-1.2.1.min.js', 'jquery-ui.min.js']

Can I do above Step 2 in the Step 1 means RE Pattern to get result?

like image 475
Vivek Sable Avatar asked Jun 10 '15 14:06

Vivek Sable


1 Answers

Sure you can. One way would be to use

re.findall('src="js/([^"]*jquery[^"]*)"', data)

This will match everything after "js/ until the nearest " if it contains jquery anywhere. If you know more about the position of jquery (for example, if it's always at the start) you can adjust the regex accordingly.

If you want to make sure that jquery is not directly surrounded by other alphanumeric characters, use word boundary anchors:

re.findall(r'src="js/([^"]*\bjquery\b[^"]*)"', data)
like image 110
Tim Pietzcker Avatar answered Sep 29 '22 18:09

Tim Pietzcker