Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do search engines find relevant content?

How does Google find relevant content when it's parsing the web?

Let's say, for instance, Google uses the PHP native DOM Library to parse content. What methods would they be for it to find the most relevant content on a web page?

My thoughts would be that it would search for all paragraphs, order by the length of each paragraph and then from possible search strings and query params work out the percentage of relevance each paragraph is.

Let's say we had this URL:

http://domain.tld/posts/stackoverflow-dominates-the-world-wide-web.html 

Now from that URL I would work out that the HTML file name would be of high relevance so then I would see how close that string compares with all the paragraphs in the page!

A really good example of this would be Facebook share, when you share a page. Facebook quickly bots the link and brings back images, content, etc., etc.

I was thinking that some sort of calculative method would be best, to work out the % of relevancy depending on surrounding elements and meta data.

Are there any books / information on the best practices of content parsing that covers how to get the best content from a site, any algorithms that may be talked about or any in-depth reply?


Some ideas that I have in mind are:

  • Find all paragraphs and order by plain text length
  • Somehow find the Width and Height of div containers and order by (W+H) - @Benoit
  • Check meta keywords, title, description and check relevancy within the paragraphs
  • Find all image tags and order by largest, and length of nodes away from main paragraph
  • Check for object data, such as videos and count the nodes from the largest paragraph / content div
  • Work out resemblances from previous pages parsed

The reason why I need this information:

I'm building a website where webmasters send us links and then we list their pages, but I want the webmaster to submit a link, then I go and crawl that page finding the following information.

  • An image (if applicable)
  • A < 255 paragraph from the best slice of text
  • Keywords that would be used for our search engine, (Stack Overflow style)
  • Meta data Keywords, Description, all images, change-log (for moderation and administration purposes)

Hope you guys can understand that this is not for a search engine but the way search engines tackle content discovery is in the same context as what I need it for.

I'm not asking for trade secrets, I'm asking what your personal approach to this would be.

like image 243
RobertPitt Avatar asked Oct 19 '10 09:10

RobertPitt


People also ask

How do search engines determine relevance?

To determine relevance, search engines use algorithms, a process or formula by which stored information is retrieved and ordered in meaningful ways. These algorithms have gone through many changes over the years in order to improve the quality of search results.

Which search engine helps find relevant information?

Google Search Tools Google also provides easy-to-use search tools.

How do search engines work step by step?

Search engines work by simply crawling billions of pages using the web crawlers they have developed. These are commonly referred to as search engine spiders or bots. A search engines spider then navigates the web by following links on a new web page it discovers to find new pages and so forth.

How does Google find relevant results?

When a user enters a query, our machines search the index for matching pages and return the results we believe are the highest quality and most relevant to the user. Relevancy is determined by hundreds of factors, which could include information such as the user's location, language, and device (desktop or phone).

How do search engines work?

How do search engines work? Search engines work through three primary functions: Crawling: Scour the Internet for content, looking over the code/content for each URL they find. Indexing: Store and organize the content found during the crawling process. Once a page is in the index, it’s in the running to be displayed as a result to relevant queries.

What is content for search engines?

If search engines are answer machines, content is the means by which the engines deliver those answers. Any time someone performs a search, there are thousands of possible results, so how do search engines decide which pages the searcher is going to find valuable?

How do search engines find and rank content?

When you use a search engine, relevant results are extracted from the index and ranked using an algorithm. If that sounds complicated, it’s because it is. But if you want to rank higher in search engines to get more traffic to your website, you need a basic understanding of how search engines find, index, and rank content.

What does relevance mean to search engines?

To a search engine, relevance means more than finding a page with the right words. In the early days of the web, search engines didn’t go much further than this simplistic step, and search results were of limited value. Over the years, smart engineers have devised better ways to match results to searchers’ queries.


1 Answers

This is a very general question but a very nice topic! Definitely upvoted :) However I am not satisfied with the answers provided so far, so I decided to write a rather lengthy answer on this.

The reason I am not satisfied is that the answers are basically all true (I especially like the answer of kovshenin (+1), which is very graph theory related...), but the all are either too specific on certain factors or too general.

It's like asking how to bake a cake and you get the following answers:

  • You make a cake and you put it in the oven.
  • You definitely need sugar in it!
  • What is a cake?
  • The cake is a lie!

You won't be satisfied because you wan't to know what makes a good cake. And of course there are a lot or recipies.

Of course Google is the most important player, but, depending on the use case, a search engine might include very different factors or weight them differently.

For example a search engine for discovering new independent music artists may put a malus on artists websites with a lots of external links in.

A mainstream search engine will probably do the exact opposite to provide you with "relevant results".

There are (as already said) over 200 factors that are published by Google. So webmasters know how to optimize their websites. There are very likely many many more that the public is not aware of (in Google's case).

But in the very borad and abstract term SEO optimazation you can generally break the important ones apart into two groups:

  1. How well does the answer match the question? Or: How well does the pages content match the search terms?

  2. How popular/good is the answer? Or: What's the pagerank?

In both cases the important thing is that I am not talking about whole websites or domains, I am talking about single pages with a unique URL.

It's also important that pagerank doesn't represent all factors, only the ones that Google categorizes as Popularity. And by good I mean other factors that just have nothing to do with popularity.

In case of Google the official statement is that they want to give relevant results to the user. Meaning that all algorithms will be optimized towards what the user wants.

So after this long introduction (glad you are still with me...) I will give you a list of factors that I consider to be very important (at the moment):

Category 1 (how good does the answer match the question?

You will notice that a lot comes down to the structure of the document!

  • The page primarily deals with the exact question.

Meaning: the question words appear in the pages title text or in heading paragraphs paragraphs. The same goes for the position of theese keywords. The earlier in the page the better. Repeated often as well (if not too much which goes under the name of keywords stuffing).

  • The whole website deals with the topic (keywords appear in the domain/subdomain)

  • The words are an important topic in this page (internal links anchor texts jump to positions of the keyword or anchor texts / link texts contain the keyword).

  • The same goes if external links use the keywords in link text to link to this page

Category 2 (how important/popular is the page?)

You will notice that not all factors point towards this exact goal. Some are included (especially by Google) just to give pages a boost, that... well... that just deserved/earned it.

  • Content is king!

The existence of unique content that can't be found or only very little in the rest of the web gives a boost. This is mostly measured by unordered combinations of words on a website that are generally used very little (important words). But there are much more sophisticated methods as well.

  • Recency - newer is better

  • Historical change (how often the page has updated in the past. Changing is good.)

  • External link popularity (how many links in?)

If a page links another page the link is worth more if the page itself has a high pagerank.

  • External link diversity

basically links from different root domains, but other factors play a role too. Factors like even how seperated are the webservers of linking sites geographically (according to their ip address).

  • Trust Rank

For example if big, trusted, established sites with redactional content link to you, you get a trust rank. That's why a link from The New York Times is worth much more than some strange new website, even if it's PageRank is higher!

  • Domain trust

Your whole website gives a boost to your content if your domain is trusted. Well different factors count here. Of course links from trusted sties to your domain, but it will even do good if you are in the same datacenter as important websites.

  • Topic specific links in.

If websites that can be resolved to a topic link to you and the query can be resolved to this topic as well, it's good.

  • Distribution of links in over time.

If you earned a lot of links in in a short period of time, this will do you good at this time and the near future afterwards. But not so good later in time. If you slow and steady earn links it will do you good for content that is "timeless".

  • Links from restrited domains

A link from a .gov domain is worth a lot.

  • User click behaviour

Whats the clickrate of your search result?

  • Time spent on site

Google analytics tracking, etc. It's also tracked if the user clicks back or clicks another result after opening yours.

  • Collected user data

Votes, rating, etc., references in Gmail, etc.

Now I will introduce a third category, and one or two points from above would go into this category, but I haven't thought of that... The category is:

** How important/good is your website in general **

All your pages will be ranked up a bit depending on the quality of your websites

Factors include:

  • Good site architecture (easy to navgite, structured. Sitemaps, etc...)

  • How established (long existing domains are worth more).

  • Hoster information (what other websites are hosted near you?

  • Search frequency of your exact name.

Last, but not least, I want to say that a lot of these theese factors can be enriched by semantic technology and new ones can be introduced.

For example someone may search for Titanic and you have a website about icebergs ... that can be set into correlation which may be reflected.

Newly introduced semantic identifiers. For example OWL tags may have a huge impact in the future.

For example a blog about the movie Titanic could put a sign on this page that it's the same content as on the Wikipedia article about the same movie.

This kind of linking is currently under heavy development and establishment and nobody knows how it will be used.

Maybe duplicate content is filtered, and only the most important of same content is displayed? Or maybe the other way round? That you get presented a lot of pages that match your query. Even if they dont contain your keywords?

Google even applies factors in different relevance depending on the topic of your search query!

like image 190
The Surrican Avatar answered Oct 02 '22 13:10

The Surrican