Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx to remove all markup between <a and </a> tags except for within [ and ]

Trying to figure out a Regular Expression gives me a brain cramp :)

I'm replacing thousands of individual hreflinks with an individual shortcode in WordPress post content using a plugin that allows me to run regular expressions on content.

Rather than try and combine an SQL query with a RegEx, I'm doing it in two stages: first the SQL to find/replace each individual URL to the individual shortcode, and the second stage, remove the rest of the 'href` link markup.

These are some examples of what I have now from the first step; as you can see, the URL has been replaced with the [nggallery id=xxx] shortcode.

<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title" 
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>

<a href="[nggallery id=xxxxx]">Click here!</a>

<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>

Now, I need to delete all the href link markup - span, img, etc - in between the leading <a and ending </a>, leaving just the shortcode [nggallery id=xxx].

I've got a start here: https://www.regex101.com/r/rL8wP1/2

But I don't know how to prevent the [nggallery id=xxx] shortcode from being captured in the RegEx.

Update 7/09/2015

@nhahtdh's answer appears to work perfectly, is not too greedy, and doesn't eat adjacent html links. Use ( and ) as delimiters and $1 as a replacement with a regex plugin in WordPress. (If using BBEdit, you will need to use \1)

( <a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a> )

Update 7/02/2015

Thanks to Fab Sa (answer below), his regex at https://www.regex101.com/r/rL8wP1/4

<a.*(\[nggallery[^\]+]*\]).*?<\/a>

works in the regex101 emulator, but when used in the BBEdit text editor or the WordPress plugin that runs regex, his regex deletes the [nggallery id=***] shortcode. So is it too greedy? Some other issue?

Update 7/01/2015:

I know, I know, re: RegEx match open tags except XHTML self-contained tags YOU CANNOT PARSE HTML WITH REGEX

like image 841
markratledge Avatar asked Jun 30 '15 19:06

markratledge


People also ask

How to remove HTML tags from string using regex?

Below is a simple regex to validate the string against HTML tag pattern. This can be later used to remove all tags and leave text only. /<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>/g; Test it!

How do I get the contents between HTML tags?

The preg_match() function is the best option to extract text between HTML tags with REGEX in PHP. If you want to get content between tags, use regular expressions with preg_match() function in PHP. You can also extract the content inside element based on class name or ID using PHP.

How do you trim a word in regex?

You can easily trim unnecessary whitespace from the start and the end of a string or the lines in a text file by doing a regex search-and-replace. Search for ^[ \t]+ and replace with nothing to delete leading whitespace (spaces and tabs). Search for [ \t]+$ to trim trailing whitespace.


2 Answers

You can use this regex

<a.*(\[nggallery[^\]+]*\]).*?<\/a>

globally (flag g). This regex will match a link and save the [nggallery ...] part. You can substitue the all match with $1 to keep the saved [nggallery ...] part.

I've updated your regex online: https://www.regex101.com/r/rL8wP1/4

PS: In this solution [nggallery ...] don't need to be in a specific attribut like href. If you want to force that, you can use <a.*href\="(\[nggallery[^\]+]*\])".*?<\/a>

like image 71
Fabien Sa Avatar answered Oct 11 '22 14:10

Fabien Sa


Fab Sa's regex <a.*(\[nggallery[^\]+]*\]).*?<\/a> gobbles up everything when there are multiple <a> tags on a single line, due to the unrestricted .* at the beginning, which will match across different <a> tags.

By restricting the allowable characters, you can somewhat match what you want:

<a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a>
  ^^^^^^^

I forced at least one whitespace after a to make sure that it's not matching some other tags, plus some extra restrictions.

Anyway, you are on your own if you discover that it doesn't work in some corner case. It's generally a bad idea to manipulate HTML with regex.

like image 31
nhahtdh Avatar answered Oct 11 '22 15:10

nhahtdh