Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do bots/spiders clone public git repositories?

I host a few public repositories on GitHub which occasionally receive clones according to traffic graphs. While I'd like to believe that many people are finding my code and downloading it, the nature of the code in some of them makes me suspect that most of these clones are coming from bots or search engine crawlers/spiders. I know myself that if I find a git repository via a search engine, I usually look at the code with my browser and decide if it's useful or not before cloning it.

Does anyone know if cloning git repositories is a standard technique for search engine crawlers, or if my code is just more popular than I think?

like image 390
Sean Avatar asked Nov 12 '16 12:11

Sean


People also ask

Can you clone a public repository?

You can create a complete local copy of a Git repository from a public project by cloning it. Cloning a repo downloads all commits and branches in the repo and sets up a named relationship with the existing repo you cloned.

Can anyone clone a public repo GitHub?

We can clone or fork any public GitHub repo. This will create a local copy of the repo and it's files. We can edit these files and push the changes back to GitHub. If you aren't the owner of the repo, you will need to make a 'pull request'.

Do you need a GitHub account to clone a public repo?

It is yes, but only read only via the http method. You should change the answer to indicate this. "No" sounds like you can't git clone Github repos without having a user and logging in.

Can private repos be cloned?

You also have the option to clone a private GitHub repository using SSH. To do this, you need to start by generating an SSH keypair on your local device. Then add a public key to your GitHub account.


1 Answers

The "Clone or download" button present in the Github page of a repository provides the URL of the repository. If you use that URL with a web browser you get the HTML page you can see in the browser. The same page is received by a web spider too.

However, if you provide the URL to a Git client, it is able to operate on the repository files (clone the repo, pull, push). This is because the Git client uses one of the two Git's own protocols built on top of HTTP.

In order to use this protocols, the Git client build URLs based on the base URL of the repository and submits HTTP requests on this URLs.

For example, if the Git URL is https://github.com/axiac/code-golf.git, a Git client tries one of the following two requests in order to find more information about the internal structure of the repository:

GET https://github.com/axiac/code-golf.git/info/refs HTTP/1.0

GET https://github.com/axiac/code-golf.git/info/refs?service=git-upload-pack HTTP/1.0

The first one is called the "dumb" protocol (and is not supported by Github anymore), the second one is called the "smart" protocol. The "dumb" one works with text message, the "smart" one works with binary string blocks and custom HTTP headers.

In order to operate on a Git repository, the Git client must parse the responses received from the server and use the information to create and submit the correct requests for the actions it intends.

A browser is not able to operate on a Git repository because it doesn't know the protocols. An all-purpose web crawler works, more or less, like a browser. It usually doesn't care too much about styles and scripts and the correctness of the HTML but regarding the HTTP it is very similar to a browser.

In order to clone your repo, a web crawler must be specifically programmed to understand the Git transport protocols. Or (better) it can run an external git clone command when it finds an URL that it thinks is the URL of a Git repository. In both situations, the crawler must be programmed with this purpose in mind: to clone Git repositories.

All in all, there is no way a web crawler (or an user using a web browser) can clone a Git repository by mistake.

A web crawler does not even need to clone Git repositories from Github or from other web servers that serve Git repositories. It can get each and every version of all the files contained in the repository by using the links the (Github or another) web server provides.

like image 143
axiac Avatar answered Sep 21 '22 11:09

axiac