Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Swift : Regex for remove all inline HTML attributes

I want to clear all attributes from an HTML String text. I've found a lot of answers to do that but the problem is the grammer for regex is not working if we don't have a properly CSS style. and my situation was difficult because the HTML text that get from an API is not in order with its style. It might be like this:

<p style="\"text-align:" justify;="" \"=""><span style="\"font-size:" 13px;="" font-family:="" arial;="" text-decoration-skip-ink:="" none;\"=""><b><span style="font-size: 18px;">Angkor Wat</span></b> is a temple complex in Cambodia and the largest religious monument in the world, on a site measuring 162.6 hectares (1,626,000 m2; 402 acres). It was originally constructed as a Hindu temple dedicated to the god Vishnu for the Khmer Empire, gradually transforming into a Buddhist temple towards the end of the 12th century. It was built by the Khmer King Suryavarman II in the early 12th century in Yaśodharapura, the capital of the Khmer Empire, as his state temple and eventual mausoleum. Breaking from the Shaiva tradition of previous kings, Angkor Wat was instead dedicated to Vishnu. As the best-preserved temple at the site, it is the only one to have remained a significant religious centre since its foundation. The temple is at the top of the high classical style of Khmer architecture. It has become a symbol of Cambodia, appearing on its national flag, and it is the country\'s prime attraction for visitors.</span></p><p style="\"text-align:" justify;="" \"=""><span style="\"font-size:" 13px;="" font-family:="" arial;="" text-decoration-skip-ink:="" none;\"="">Angkor Wat combines two basic plans of Khmer temple architecture: the temple-mountain and the later galleried temple. It is designed to represent Mount Meru, home of the devas in Hindu mythology: within a moat and an outer wall 3.6 kilometres (2.2 mi) long are three rectangular galleries, each raised above the next. At the centre of the temple stands a quincunx of towers. Unlike most Angkorian temples, Angkor Wat is oriented to the west; scholars are divided as to the significance of this. The temple is admired for the grandeur and harmony of the architecture, its extensive bas-reliefs, and for the numerous devatas adorning its walls.</span></p>

You can test this string by copy and paste the whole text into this website because I want to find the correct Regex that can remove all CSS style.

I want the regex that can work like this Useful HTML Cleaner Website

This is before cleaning the HTML:

Before cleaning HTML

And this is after cleaning the HTML:

After Cleaning HTML

These website can clean all HTML attribute and it doesn't care if those attribute is in the wrong format

I found many Regex on the website that can clean the html attribute but it's not working with my situation: Here are some regex:

  • <[^>]+((style|class)="[^"]*")[^>]*>
  • <\s*([a-z][a-z0-9]*)\s.*?>
  • style=\"([^\"]*)\"
  • style="(.*?)"
  • <\\s*([a-z][a-z0-9]*)\\s.*?>

EDIT here is a useful function that can remove the style from Tobi:

let regex = try! NSRegularExpression(pattern: "style=\"([^\"]*)\"", options: .caseInsensitive)
        let range = NSMakeRange(0, html.characters.count)
        let modString = regex.stringByReplacingMatches(in: html, options: [], range: range, withTemplate: "")

And the result of this regex is still like this:

<p text-align:" justify;="" \"=""><span font-size:" 13px;="" font-family:="" arial;="" text-decoration-skip-ink:="" none;\"=""><b><span >Angkor Wat</span></b> is a temple complex in Cambodia and the largest religious monument in the world, on a site measuring 162.6 hectares (1,626,000 m2; 402 acres). It was originally constructed as a Hindu temple dedicated to the god Vishnu for the Khmer Empire, gradually transforming into a Buddhist temple towards the end of the 12th century. It was built by the Khmer King Suryavarman II in the early 12th century in Yaśodharapura, the capital of the Khmer Empire, as his state temple and eventual mausoleum. Breaking from the Shaiva tradition of previous kings, Angkor Wat was instead dedicated to Vishnu. As the best-preserved temple at the site, it is the only one to have remained a significant religious centre since its foundation. The temple is at the top of the high classical style of Khmer architecture. It has become a symbol of Cambodia, appearing on its national flag, and it is the country\'s prime attraction for visitors.</span></p><p text-align:" justify;="" \"=""><span font-size:" 13px;="" font-family:="" arial;="" text-decoration-skip-ink:="" none;\"="">Angkor Wat combines two basic plans of Khmer temple architecture: the temple-mountain and the later galleried temple. It is designed to represent Mount Meru, home of the devas in Hindu mythology: within a moat and an outer wall 3.6 kilometres (2.2 mi) long are three rectangular galleries, each raised above the next. At the centre of the temple stands a quincunx of towers. Unlike most Angkorian temples, Angkor Wat is oriented to the west; scholars are divided as to the significance of this. The temple is admired for the grandeur and harmony of the architecture, its extensive bas-reliefs, and for the numerous devatas adorning its walls.</span></p>

Please use this Website to test my given string

This regex can clear the CSS (style=" ") only

This regex can clear only the style that in format style=" " only

like image 604
Anonymous-E Avatar asked Sep 21 '18 06:09

Anonymous-E


2 Answers

You can use SwiftSoup to help you solve this problem. Here is my code

    do {
        let doc: Document = try SwiftSoup.parse(html)
        let elements = try doc.getAllElements()

        try elements.forEach { (el) in
            let attr = el.getAttributes()

            try attr?.forEach({ (att) in
                try el.removeAttr(att.getKey())
            })
        }
        print(try doc.body()?.html())
    } catch Exception.Error(let type, let message) {
        print(type,message)
    } catch {
        print("error")
    }

here is the result

<p><span><b><span>Angkor Wat</span></b> is a temple complex in Cambodia and the largest religious monument in the world, on a site measuring 162.6 hectares (1,626,000 m2; 402 acres). It was originally constructed as a Hindu temple dedicated to the god Vishnu for the Khmer Empire, gradually transforming into a Buddhist temple towards the end of the 12th century. It was built by the Khmer King Suryavarman II in the early 12th century in Yaśodharapura, the capital of the Khmer Empire, as his state temple and eventual mausoleum. Breaking from the Shaiva tradition of previous kings, Angkor Wat was instead dedicated to Vishnu. As the best-preserved temple at the site, it is the only one to have remained a significant religious centre since its foundation. The temple is at the top of the high classical style of Khmer architecture. It has become a symbol of Cambodia, appearing on its national flag, and it is the country\'s prime attraction for visitors.</span></p>\n<p><span>Angkor Wat combines two basic plans of Khmer temple architecture: the temple-mountain and the later galleried temple. It is designed to represent Mount Meru, home of the devas in Hindu mythology: within a moat and an outer wall 3.6 kilometres (2.2 mi) long are three rectangular galleries, each raised above the next. At the centre of the temple stands a quincunx of towers. Unlike most Angkorian temples, Angkor Wat is oriented to the west; scholars are divided as to the significance of this. The temple is admired for the grandeur and harmony of the architecture, its extensive bas-reliefs, and for the numerous devatas adorning its walls.</span></p>

hope this could help you :)

like image 99
da vamp Avatar answered Sep 24 '22 12:09

da vamp


It's not so easy as your HTML is completely broken. I recommend you to ask to your API designer why the API outputs this sort of completely broken HTML.

Anyway, if you need to work with this sort of HTML-like something using regex, you may need to detect opening tag and remove everything other than tag name:

import Foundation

let inputHTML = """
<p style="\\"text-align:" justify;="" \\"=""><span style="\\"font-size:" 13px;="" font-family:="" arial;="" text-decoration-skip-ink:="" none;\\"=""><b><span style="font-size: 18px;">Angkor Wat</span></b> is a temple complex in Cambodia and the largest religious monument in the world, on a site measuring 162.6 hectares (1,626,000 m2; 402 acres). It was originally constructed as a Hindu temple dedicated to the god Vishnu for the Khmer Empire, gradually transforming into a Buddhist temple towards the end of the 12th century. It was built by the Khmer King Suryavarman II in the early 12th century in Yaśodharapura, the capital of the Khmer Empire, as his state temple and eventual mausoleum. Breaking from the Shaiva tradition of previous kings, Angkor Wat was instead dedicated to Vishnu. As the best-preserved temple at the site, it is the only one to have remained a significant religious centre since its foundation. The temple is at the top of the high classical style of Khmer architecture. It has become a symbol of Cambodia, appearing on its national flag, and it is the country\\'s prime attraction for visitors.</span></p><p style="\\"text-align:" justify;="" \\"=""><span style="\\"font-size:" 13px;="" font-family:="" arial;="" text-decoration-skip-ink:="" none;\\"="">Angkor Wat combines two basic plans of Khmer temple architecture: the temple-mountain and the later galleried temple. It is designed to represent Mount Meru, home of the devas in Hindu mythology: within a moat and an outer wall 3.6 kilometres (2.2 mi) long are three rectangular galleries, each raised above the next. At the centre of the temple stands a quincunx of towers. Unlike most Angkorian temples, Angkor Wat is oriented to the west; scholars are divided as to the significance of this. The temple is admired for the grandeur and harmony of the architecture, its extensive bas-reliefs, and for the numerous devatas adorning its walls.</span></p>
"""
let openingTagPattern = "(<[a-z0-9]+)\\s*([^>]*)(/?>)"
class TagCleaningRegex: NSRegularExpression {
    override func replacementString(for result: NSTextCheckingResult, in string: String, offset: Int, template templ: String) -> String {
        print(string[Range(result.range, in: string)!])
        if
            result.numberOfRanges >= 4,
            case let attrRng = result.range(at: 2),
            attrRng.location != NSNotFound,
            attrRng.length != 0
        {
            let tagStart = string[Range(result.range(at: 1), in: string)!]
            let tagEnd = string[Range(result.range(at: 3), in: string)!]
            return "\(tagStart)\(tagEnd)"
        } else {
            return super.replacementString(for: result, in: string, offset: offset, template: templ)
        }
    }
}
let regex = try! TagCleaningRegex(pattern: openingTagPattern, options: .caseInsensitive)
let output = regex.stringByReplacingMatches(in: inputHTML, range: NSRange(0..<inputHTML.utf16.count), withTemplate: "$0")
print(output)

Seems da vamp's answer is far better.

like image 45
OOPer Avatar answered Sep 21 '22 12:09

OOPer