Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stop search engines to index specific parts of the page

I have a php page that renders a book of let's say 100 pages. Each page has a specific url (e.g. /my-book/page-one, /my-book/page-two etc).

When flipping the pages, I change the url using the history API, using url.js.

Since all the book content is rendered from the server side, the problem is that the content is indexed by search engines (especially I'm referring to Google), but the urls are wrong (e.g. it finds a snippet on page-two but the url is page-one).

How to stop search engines (at least Google) to index all the content on the page, but index only the visible book page?

Would it work if I render the content in a different way: for example, <div data-page-number="1" data-content="Lorem ipsum..."></div> and then on the JavaScript side to change that in the needed format? That would make the page slower and in fact I'm not sure if Google will not index the changed content by JavaScript.

The code looks like this:

<div data-page="1">Page 1</div>
<div data-page="2">Page 2</div>
<div data-page="3" class="current-page">Page 3</div>
<div data-page="4">Page 4</div>
<div data-page="5">Page 5</div>

Then only visible div is the .current-page one. The same content is served on multiple urls because that's needed so the user can flip between pages.

For example, /book/page/3 will render this piece of HTML while /book/page/4 renders the same thing, the only difference being the current-page class which is added to the 4th element.

Google did index different urls, but it did it wrong: for example, the snippet Page 5 links to /book/page/2 which renders to the user Page 2 (not Page 5).

How to tell Google (and other search engines) I'm only interested to index the content in the .current-page?

like image 756
Ionică Bizău Avatar asked May 06 '16 09:05

Ionică Bizău


People also ask

How do I stop search engines from indexing?

You can prevent a page or other resource from appearing in Google Search by including a noindex meta tag or header in the HTTP response. When Googlebot next crawls that page and sees the tag or header, Google will drop that page entirely from Google Search results, regardless of whether other sites link to it.

What does it mean for search engines to index a page?

Indexing is the process by which search engines organise information before a search to enable super-fast responses to queries. Searching through individual pages for keywords and topics would be a very slow process for search engines to identify relevant information.

How do I restrict a search engine?

If you do not remove the tag, your page will not be indexed or searchable via search engines. Block a single outgoing link. To hide a single link on a page, embed a rel tag within the <a href> </a> link tag. You may wish to use this tag to block links on other pages that lead to the specific page you want to block.

How do I stop content from showing up in search results?

You can prevent new content from appearing in results by adding the URL slug to a robots. txt file. Search engines use these files to understand how to index a website's content. If search engines have already indexed your content, you can add a "noindex" meta tag to the content's head HTML.


2 Answers

As I understood he issue is that you have same content for many urls. Like:

www.my-awesome-domain.com/my-book/page/42

www.my-awesome-domain.com//my-book/page/7

And the visible content of the page is adjustable by JavaScript, that User Execute when he clicks some elements on your site.

In This case you need to do 2 things:

  1. Mark your URL's as Canonical pages in any of the ways described in this google document: https://support.google.com/webmasters/answer/139066?hl=en
  2. You need add a feature that each page will load to the same state after full page refresh, for example you can use hash parameter when navigating as desiribed in the article here: or here is the overview of the technique

Today google bot is executing JavaScript as announced in their official blog: https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html

So if you achieve proper page behavior when hitting Refresh (F5) and Will specify the canonical pages property, pages will be correctly crawled, and when you will follow the link you will get to the linked page.

If you need more guidance how to do it in url.js Please post another question (so it's will be proper documented for others) and I will be glad to help.

like image 89
OBender Avatar answered Sep 27 '22 19:09

OBender


The answere is really simple: you can't do it. There is no technical possibility to keep the same content under different URLs and ask search engines to index only part of it.

If you are OK with having only one page indexed you can use, as suggested before, canonical URLs. You place the canonical URL that links to the main page on every sub-page.

You may find a "hack" that uses special tags used for Google Search Appliance: googleon and googleoff.

https://www.google.com/support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/admin_crawl/preparing.html

The only issue is this will most likely not work with Google Bot (at least no one will guarantee it will) or any other search engine.

like image 39
Aleksander Wons Avatar answered Sep 27 '22 19:09

Aleksander Wons