I'm trying to fetch some HTML from various blogs and have noticed that different providers use the same tag in different ways.
For example, here are two major providers that use the meta name generator tag differently:
<meta content='blogger' name='generator'/>
(content first, name later and, yes, single quotes!) <meta name="generator" content="WordPress.com" />
(name first, content later)Is there a way to extract the value of content for all cases (single/double quotes, first/last in the row)?
P.S. Although I'm using Java, the answer would probably help more people if it where for regular expressions generally.
Complete HTML/CSS Course 2022Use the <var> tag in HTML to add a variable. The HTML <var> tag is used to format text in a document. It can include a variable in a mathematical expression.
The getAttribute() method returns the value of an element's attribute.
First, load the Markup Validation Service in one browser tab, if it isn't already open. Switch to the Validate by Direct Input tab. Copy all of the sample document's code (not just the body) and paste it into the large text area shown in the Markup Validation Service. Press the Check button.
How not to use the <br> tag. A common misuse of the <br> tag is to use it to create spaces or gaps in your content. WYSIWYG editors are notorious for injecting multiple line breaks or empty paragraph tags.
The answer is: don't use regular expressions.
Seriously. Use a SGML parser, or an XML parser if you happen to know it's valid XML (probably almost never true). You will absolutely screw up and waste tons of time trying to get it right. Just use what's already available.
Actually, you should probably use some sort of HTML parser where you can inspect each node (and therefore node attributes) in the DOM of the page. I've not used any of these for a while so I don't know the pros and cons but here's a list http://java-source.net/open-source/html-parsers
Those differences are not really important according to the XHTML standard.
In other words, they are exactly the same thing.
Also, if you replace double quotes with single quotes would be the same.
The typical way of 'normalizing' an xml document is to pare it using some API that treats the document as its Infoset representation. Both DOM and SAX style APIs work that way.
If you want to parse them by hand (or with a RegEx) you have to replicate all those things in your code and, in my opinion, that's not practical.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With