Trying to figure out a Regular Expression gives me a brain cramp :)
I'm replacing thousands of individual href
links with an individual shortcode in WordPress post content using a plugin that allows me to run regular expressions on content.
Rather than try and combine an SQL query with a RegEx, I'm doing it in two stages: first the SQL to find/replace each individual URL to the individual shortcode, and the second stage, remove the rest of the 'href` link markup.
These are some examples of what I have now from the first step; as you can see, the URL has been replaced with the [nggallery id=xxx]
shortcode.
<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title"
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>
<a href="[nggallery id=xxxxx]">Click here!</a>
<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>
Now, I need to delete all the href
link markup - span
, img
, etc - in between the leading <a
and ending </a>
, leaving just the shortcode [nggallery id=xxx]
.
I've got a start here: https://www.regex101.com/r/rL8wP1/2
But I don't know how to prevent the [nggallery id=xxx]
shortcode from being captured in the RegEx.
Update 7/09/2015
@nhahtdh's answer appears to work perfectly, is not too greedy, and doesn't eat adjacent html links. Use (
and )
as delimiters and $1
as a replacement with a regex plugin in WordPress. (If using BBEdit, you will need to use \1
)
( <a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a> )
Update 7/02/2015
Thanks to Fab Sa (answer below), his regex at https://www.regex101.com/r/rL8wP1/4
<a.*(\[nggallery[^\]+]*\]).*?<\/a>
works in the regex101 emulator, but when used in the BBEdit text editor or the WordPress plugin that runs regex, his regex deletes the [nggallery id=***]
shortcode. So is it too greedy? Some other issue?
Update 7/01/2015:
I know, I know, re: RegEx match open tags except XHTML self-contained tags YOU CANNOT PARSE HTML WITH REGEX
Below is a simple regex to validate the string against HTML tag pattern. This can be later used to remove all tags and leave text only. /<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>/g; Test it!
The preg_match() function is the best option to extract text between HTML tags with REGEX in PHP. If you want to get content between tags, use regular expressions with preg_match() function in PHP. You can also extract the content inside element based on class name or ID using PHP.
You can easily trim unnecessary whitespace from the start and the end of a string or the lines in a text file by doing a regex search-and-replace. Search for ^[ \t]+ and replace with nothing to delete leading whitespace (spaces and tabs). Search for [ \t]+$ to trim trailing whitespace.
You can use this regex
<a.*(\[nggallery[^\]+]*\]).*?<\/a>
globally (flag g). This regex will match a link and save the [nggallery ...]
part. You can substitue the all match with $1 to keep the saved [nggallery ...]
part.
I've updated your regex online: https://www.regex101.com/r/rL8wP1/4
PS: In this solution [nggallery ...]
don't need to be in a specific attribut like href. If you want to force that, you can use <a.*href\="(\[nggallery[^\]+]*\])".*?<\/a>
Fab Sa's regex <a.*(\[nggallery[^\]+]*\]).*?<\/a>
gobbles up everything when there are multiple <a>
tags on a single line, due to the unrestricted .*
at the beginning, which will match across different <a>
tags.
By restricting the allowable characters, you can somewhat match what you want:
<a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a>
^^^^^^^
I forced at least one whitespace after a
to make sure that it's not matching some other tags, plus some extra restrictions.
Anyway, you are on your own if you discover that it doesn't work in some corner case. It's generally a bad idea to manipulate HTML with regex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With