Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linking together >100K pages without getting SEO penalized

I'm making a site which will have reviews of the privacy policies of hundreds of thousands of other sites on the internet. Its initial content is based on my running through the CommonCrawl 5 billion page web dump and analyzing all the privacy policies with a script, to identify certain characteristics (e.g. "Sells your personal info").

According to the SEO MOZ Beginner's Guide to SEO:

Search engines tend to only crawl about 100 links on any given page. This loose restriction is necessary to keep down on spam and conserve rankings.

I was wondering what would be a smart way to create a web of navigation that leaves no page orphaned, but would still avoid this SEO penalty they speak of. I have a few ideas:

  • Create alphabetical pages (or Google Sitemap .xml's), like "Sites beginning with Ado*". And it would link "Adobe.com" there for example. This, or any other meaningless split of the pages, seems kind of contrived and I wonder whether Google might not like it.
  • Using meta keywords or descriptions to categorize
  • Find some way to apply more interesting categories, such as geographical or content-based. My concern here is I'm not sure how I would be able to apply such categories across the board to so many sites. I suppose if need be I could write another classifier to try and analyze the content of the pages from the crawl. Sounds like a big job in and of itself though.
  • Use the DMOZ project to help categorize the pages.

Wikipedia and StackOverflow have obviously solved this problem very well by allowing users to categorize or tag all of the pages. In my case I don't have that luxury, but I want to find the best option available.

At the core of this question is how Google responds to different navigation structures. Does it penalize those who create a web of pages in a programmatic/meaningless way? Or does it not care so long as everything is connected via links?

like image 550
bgcode Avatar asked May 07 '12 17:05

bgcode


1 Answers

Google PageRank does not penalize you for having >100 links on a page. But each link above a certain threshold decreases in value/importance in the PageRank algorithm.

Quoting SEOMOZ and Matt Cutts:

Could You Be Penalized?

Before we dig in too deep, I want to make it clear that the 100-link limit has never been a penalty situation. In an August 2007 interview, Rand quotes Matt Cutts as saying:

The "keep the number of links to under 100" is in the technical guideline section, not the quality guidelines section. That means we're not going to remove a page if you have 101 or 102 links on the page. Think of this more as a rule of thumb.

At the time, it's likely that Google started ignoring links after a certain point, but at worst this kept those post-100 links from passing PageRank. The page itself wasn't going to be de-indexed or penalized.

So the question really is how to get Google to take all your links seriously. You accomplish this by generating a XML sitemap for Google to crawl (you can either have a static sitemap.xml file, or its content can be dynamically generated). You will want to read up on the About Sitemaps section of the Google Webmaster Tools help documents.

Just like having too many links on a page is an issue,having too many links in a XML sitemap file is also an issue. What you need to do is paginate your XML sitemap. Jeff Atwood talks about how StackOverflow implements this: The Importance of Sitemaps. Jeff also discusses the same issue on StackOverflow podcast #24.

Also, this concept applies to Bing as well.

like image 166
Jason Avatar answered Oct 14 '22 13:10

Jason