Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to fetch a varying HTML tag

I'm trying to fetch some HTML from various blogs and have noticed that different providers use the same tag in different ways.

For example, here are two major providers that use the meta name generator tag differently:

  • Blogger: <meta content='blogger' name='generator'/> (content first, name later and, yes, single quotes!)
  • WordPress: <meta name="generator" content="WordPress.com" /> (name first, content later)

Is there a way to extract the value of content for all cases (single/double quotes, first/last in the row)?

P.S. Although I'm using Java, the answer would probably help more people if it where for regular expressions generally.

like image 424
pek Avatar asked Aug 28 '08 02:08

pek


People also ask

How do you pass a variable tag in HTML?

Complete HTML/CSS Course 2022Use the <var> tag in HTML to add a variable. The HTML <var> tag is used to format text in a document. It can include a variable in a mathematical expression.

Which method we will use to fetch the value from HTML tag?

The getAttribute() method returns the value of an element's attribute.

What is the efficient and correct way to find a valid HTML element?

First, load the Markup Validation Service in one browser tab, if it isn't already open. Switch to the Validate by Direct Input tab. Copy all of the sample document's code (not just the body) and paste it into the large text area shown in the Markup Validation Service. Press the Check button.

What is the most wrongly used HTML tag?

How not to use the <br> tag. A common misuse of the <br> tag is to use it to create spaces or gaps in your content. WYSIWYG editors are notorious for injecting multiple line breaks or empty paragraph tags.


3 Answers

The answer is: don't use regular expressions.

Seriously. Use a SGML parser, or an XML parser if you happen to know it's valid XML (probably almost never true). You will absolutely screw up and waste tons of time trying to get it right. Just use what's already available.

like image 101
Brad Wilson Avatar answered Nov 08 '22 01:11

Brad Wilson


Actually, you should probably use some sort of HTML parser where you can inspect each node (and therefore node attributes) in the DOM of the page. I've not used any of these for a while so I don't know the pros and cons but here's a list http://java-source.net/open-source/html-parsers

like image 34
martinatime Avatar answered Nov 08 '22 02:11

martinatime


Those differences are not really important according to the XHTML standard.

In other words, they are exactly the same thing.

Also, if you replace double quotes with single quotes would be the same.

The typical way of 'normalizing' an xml document is to pare it using some API that treats the document as its Infoset representation. Both DOM and SAX style APIs work that way.

If you want to parse them by hand (or with a RegEx) you have to replicate all those things in your code and, in my opinion, that's not practical.

like image 2
Sergio Acosta Avatar answered Nov 08 '22 03:11

Sergio Acosta