Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How some site with fake links show up in Search Engine's results

These days I come across several Google search results that contain sites with links that exactly match my search words. How is it possible for the sites to dynamically change their content or rather how are they fooling google into indexing their page for my keyword. I've read about content farms but that doesn't seem to be a right answer. Can someone let me know what this technique is called? I'll try to understand more about it.

like image 726
Gopal Avatar asked Nov 03 '11 03:11

Gopal


People also ask

Why does Google only show links?

We only show sitelinks for results when we think they'll be useful to the user. If the structure of your site doesn't allow our algorithms to find good sitelinks, or we don't think that the sitelinks for your site are relevant for the user's query, we won't show them.


2 Answers

My understanding is that the only way to get on Google or any other indexing engine is to have the robot actually crawl your site and generate results. Obviously, Google can crawl dynamic sites:

  • http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html

however I find this to be an evolutionary rather then revolutionary change with regard to your question.

What I think is happening behind the scenes is the combination of these things:

  • Content index
  • Prepared index
  • User submitted content
  • Referrer search updates

I'll try to explain each of these on a fictional site that sells music - you have plenty of examples to compare the experience. It will of course be on example.com domain.

Content index

Obviously, as a site that wants to offer something, you actually have some content. Usually, you group this contents somehow. Let's assume our music site can group content by different categories:

  • Author
  • Music genre
  • User submitted
  • Content ratings

Each of these can be represented abstractly as a tag. For example, our site could choose to have example.com/tags/eagles to represent Eagles or example.com/tags/rock to represent all rock bands. Google would be able to index these, so any potential search could yield a link to our site.

Prepared index

Prepared index is similar, but is a generic index instead of real content. This can be prepared in several ways, such as:

  • Take a dictionary and add all words
  • Crawl a few million pages from the Web (possibly using links provided by search engines!) and get often repeated phrases from there
  • Grab content from free forums
  • Use Wikipeda
  • Get text from freely available books, such as those from Project Gutenberg

Our site would, for example, get any words from texts that are related to music in any way and make tags similar to the previous ones. E.g. just by crawling the Rock music page on Wikipedia, you can get a lot of tags.

User submitted content

This is something that usually comes after your site is up and running. Let's say that we put a search box on our site and then users come in and type "rock music". Doh, we already knew that, so nothing good from that search. However, let's say we go throughout our Web server logs and see some searches for langeleik. Now, that would be something we might not have indexed before. Cool, just generated another tag on our site.

Obviously, Google doesn't know that - so we create an entry in our sitemap and it's there after another Googlebot crawl. When an user searches on Google for "langeleik", one of the links might be a link to example.com/tags/langeleik.

There are other and possibly far more valuable forms of user input - comments, forum posts, etc. Hence the reason there are many generic forums that have no other purpose except hosting forums. It's a great data source and you get new content for free.

At the end, all this should go to your site sitemap. You can have huge sitemaps, see this:

  • https://webmasters.stackexchange.com/questions/26964/google-sitemap-for-dynamic-url-structure

Referrals

The last thing is referrals. Again after your site is up and running, some of the Google searches will come directly to you. That's when you can take advantage of the HTTP Referer header (yes, it's a misspelling - check it out on Wikipedia), see this:

  • Is it possible to capture search term from Google search?

Note that Google search is both:

  • Incomplete
  • Fuzzy

Thus, you can search for "langeleik" above, but some of the links have the title of e.g. "Langeleik and Harpe". Nothing unusual, but note also the reverse - if you search for "langeleik and harpe", it will not only find all pages with both terms, but also pages with one or another. If our we know for harpe, but not for langeleik, and somebody searches for "langeleik and harpe", we will get through HTTP Referer header a q paramter such as q=langeleik+harpe. Cool - just got another word to add to our sitemap, if we want.

As for fuzziness, note that when you search for "eagles", you can get everything from birds through NFL teams to a rock band. Thus, even though we are a music site, we might expand our horizon (if desired) to latest NFL news - something totally unrelated and very useful for some sites.

Conclusion - it's an illusion

I consider the combination of all these a very rich sitemap building source. You can very easily generate millions of unique tags using the above techniques. Thus, "anything" you type will be found on example.com/tags.

However, you have to note that this is just an illusion. For example, if you search for "ertfghedctgb" (easily typed on regular QWERTY keyboard - ert + fgh + edc + tgb), you will most likely not get anything from Google (I do not currently). It just was not common enough for anybody to put this in their sitemaps (or not common enough for search engines to index it).

like image 192
icyrock.com Avatar answered Oct 05 '22 17:10

icyrock.com


All browsers and crawlers send something called a HTTP_USER_AGENT string to the web server upon every request, unless it is not added by the software on purpose. This string identifies what browser is used, what version it is, render engine and some more details. (See http://en.wikipedia.org/wiki/User_agent)

The web server can read the HTTP_USER_AGENT and change the content served. For instance, it is used as a part of detecting wether you are on a handheld device or a large screen, in which cases you may want a different layout of the given web page.

People put a lot of money into driving traffic to their sites, especially through the large search engines like Google and Bing. The term SEO, which stands for Search Engine Optimization, is a technique where the owner of the web page optimizes his content to make it easy for the search engines to give relevant hits. If you have a complex site using lots of JavaScript and Ajax, you may want to serve a static page to the search engines to allow them to read your content.

Malicious sites sometimes serve auto generated, SEO optimized content to the search engines to rank high in searches but deliver human users a simple page with ads instead to drive revenue.

This answer is provided as an alternative to an answer where normal dynamic content, as already described by icyrock-com, is the cause of getting another page than Google indicates.

like image 38
jornare Avatar answered Oct 05 '22 17:10

jornare