Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Set up a GitHub mirror repository without duplicating search results

When I search a file from my repository, I get a random mirror as first result, while the official location (old URL 301s) and even the official GitHub mirror do not appear in search results.

I know GitHub used to help with mirroring but I'm not sure they still do. Did we do something wrong with our repository browser, or with the mirror?

Does it matter that the official GitHub mirror doesn't have a "master" branch and should the other mirror rename master? Can we do more to "Syndicate carefully"? Our GitHub mirror links back to the official mirror, but only indirectly and only from the main repository page.

like image 656
Nemo Avatar asked Aug 18 '16 13:08

Nemo


2 Answers

This is an issue with Search Engine Optimisation.

The reason you'll get that random copy of your repository top of a random file search is because it has better metrics than your main repository does. You need to gain more backlinks / visibility not just to the main repository's page but to the individual files.

When searching for operations-puppet, I do indeed get the wikimedia github repository. The separate site you've set up (mediawiki.org) will need more backlinks and other ranking metrics in order to increase it's visibility. Github is simply a far more authoritative site.

If Github won't assist with canonical linking then you'll have to gather backlinks and attention via other methods.

like image 153
L Martin Avatar answered Oct 19 '22 22:10

L Martin


I respectfully believe this is an expectations issue. You say that you want to "syndicate carefully", but open-source software is basically the antithesis of that - allowing anyone to syndicate your code anywhere, outside of your control, restricted only by the terms of the OSS license.

When you search for something on Google, they return what they believe to be the most authoritative, relevant result for your query, not necessarily the original source of it. Google isn't smart enough yet to know for sure what the "official" or "original" source of a piece of information is, short of using a lot of educated guesses (first-seen date, backlinks, site authority), which can sometimes result in the wrong answer.

Even if Google were to know which repository/webpage were the "official" source for the info, it might have reasons to link to an alternate source that the algorithm perceives as more "usable" or "fresh" (e.g. a recently updated repo compared to an abandoned repo, a repo with less backlinks, a read-only archive, a repo on a less popular repo-hosting site, etc).

If this were proprietary code, the solution would be to DMCA takedown the unofficial copies of your code, either at the source or with Google. But since this code's license presumably allows it to be copied freely, you have no control over syndication, and what Google perceives as the most useful result may not be the official source.

Did we do something wrong with our repository browser, or with the mirror?

There's no reason to believe that, afaik. This rankings issue is a classic foray into the strange world of SEO.

My advice is to not worry too much about where searches of random files in your project take you. Your GitHub mirror is already the top result for "wikimedia puppet", which is what I'd expect most users to search first if they needed to look at the up-to-date version of any files in your repo.

like image 30
Maximillian Laumeister Avatar answered Oct 19 '22 22:10

Maximillian Laumeister