For some mysterious reason, Google has indexed both these adresses, that lead to the same page:
/something/some-text-1055.html
and
/index.php?pg=something&id=1055
(short notice - the site has had friendly urls since its launch, I have no idea how google found the "index.php?" url - there are "unfriendly" urls only in the content management system, which is password-restricted)
What can I do to solve the situation? (I have around 1000 pages that are double-indexed.) Somebody told me to use "disallow: index.php?" in the robots.txt file. Right or wrong? Any other suggestions?
You'd be surprised as how pervasive and quick the google bots are at indexing site content. That, combined with lots of CMS systems creating unintended pages/links making it likely that at some point those links were exposed is the most likely culprit. It's also possible your administration area isn't as secure as you think, the google bot got through that way.
The well-behaved, and google recommended, things to do here are
If possible, create 301 redirects from you query string style URLs to your canonical style URLs. That's you saying "hey there, web bot/browser, the content that used to be at this URL is now at this other URL"
Block the query string content in your robots.txt. That's like asking the spiders or other automated programs "Hey, please don't look at this stuff. These aren't the URLs you're looking for"
Google apparently allows you to specify a canonical URL now via a <link /> tag in the top of your page. Consider adding these in.
As to whether doing the well behaved things is the the "right" thing to do re: Google rankings ... who knows. Only "Google" knows how their algorithms work now, and will work in the future, and by Google, I mean a bunch of engineers and executives with conflicting goals on how search should work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With