Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fetching metadata from url

I have used Jsoup library to fetch the metadata from url.

Document doc = Jsoup.connect("http://www.google.com").get();  
String keywords = doc.select("meta[name=keywords]").first().attr("content");  
System.out.println("Meta keyword : " + keywords);  
String description = doc.select("meta[name=description]").get(0).attr("content");  
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");  

String src = images.get(0).attr("src");
System.out.println("Meta description : " + description); 
System.out.println("Meta image URl : " + src);

But I want to do it in client side using javascript

like image 226
SR230 Avatar asked Mar 10 '16 08:03

SR230


People also ask

How do I find the metadata of a URL?

The metadata-url command specifies the URL of a remote server where metadata is obtained from. This command is relevant only when the following conditions are met. The provider-type command is set to native . The metadata-from command is set to external-url .

Is a URL metadata?

URL Metadata provides additional information about that site that's embedded into a website's code. In Sprout Social you can to preview and edit metadata fields that populate in the Compose window after you type in a URL. Sprout uses a “scraper” to gather metadata from the linked website.

What is meta data in HTTP?

When you retrieve a web page or resource from a server, the server sends with it various bits of information about the thing you are retrieving (metadata). It uses a format referred to as HTTP headers. One of the items you may find in such metadata is language related.

How do you get meta in HTML?

<meta> tags always go inside the <head> element, and are typically used to specify character set, page description, keywords, author of the document, and viewport settings. Metadata will not be displayed on the page, but is machine parsable.


1 Answers

You can't do it client only because of the cross-origin issue. You need a server side script to get the content of the page.

OR You can use YQL. In this way, the YQL will used as proxy. https://policies.yahoo.com/us/en/yahoo/terms/product-atos/yql/index.htm

Or you can use https://cors-anywhere.herokuapp.com. In this way, cors-anywhere will used as proxy:

For example:

$('button').click(function() {
  $.ajax({
    url: 'https://cors-anywhere.herokuapp.com/' + $('input').val()
  }).then(function(data) {
    var html = $(data);

    $('#kw').html(getMetaContent(html, 'description') || 'no keywords found');
    $('#des').html(getMetaContent(html, 'keywords') || 'no description found');
    $('#img').html(html.find('img').attr('src') || 'no image found');
  });
});

function getMetaContent(html, name) {
  return html.filter(
  (index, tag) => tag && tag.name && tag.name == name).attr('content');
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

<input type="text" placeholder="Type URL here" value="http://www.html5rocks.com/en/tutorials/cors/" />
<button>Get Meta Data</button>

<pre>
  <div>Meta Keyword: <div id="kw"></div></div>
  <div>Description: <div id="des"></div></div>
  <div>image: <div id="img"></div></div>
</pre>
like image 185
Mosh Feu Avatar answered Sep 26 '22 02:09

Mosh Feu