Do bots/spiders clone public git repositories?

Tags:

I host a few public repositories on GitHub which occasionally receive clones according to traffic graphs. While I'd like to believe that many people are finding my code and downloading it, the nature of the code in some of them makes me suspect that most of these clones are coming from bots or search engine crawlers/spiders. I know myself that if I find a git repository via a search engine, I usually look at the code with my browser and decide if it's useful or not before cloning it.

Does anyone know if cloning git repositories is a standard technique for search engine crawlers, or if my code is just more popular than I think?

390

asked Nov 12 '16 12:11

Sean

1 Answers

The "Clone or download" button present in the Github page of a repository provides the URL of the repository. If you use that URL with a web browser you get the HTML page you can see in the browser. The same page is received by a web spider too.

However, if you provide the URL to a Git client, it is able to operate on the repository files (clone the repo, pull, push). This is because the Git client uses one of the two Git's own protocols built on top of HTTP.

In order to use this protocols, the Git client build URLs based on the base URL of the repository and submits HTTP requests on this URLs.

For example, if the Git URL is https://github.com/axiac/code-golf.git, a Git client tries one of the following two requests in order to find more information about the internal structure of the repository:

GET https://github.com/axiac/code-golf.git/info/refs HTTP/1.0

GET https://github.com/axiac/code-golf.git/info/refs?service=git-upload-pack HTTP/1.0

The first one is called the "dumb" protocol (and is not supported by Github anymore), the second one is called the "smart" protocol. The "dumb" one works with text message, the "smart" one works with binary string blocks and custom HTTP headers.

In order to operate on a Git repository, the Git client must parse the responses received from the server and use the information to create and submit the correct requests for the actions it intends.

A browser is not able to operate on a Git repository because it doesn't know the protocols. An all-purpose web crawler works, more or less, like a browser. It usually doesn't care too much about styles and scripts and the correctness of the HTML but regarding the HTTP it is very similar to a browser.

In order to clone your repo, a web crawler must be specifically programmed to understand the Git transport protocols. Or (better) it can run an external git clone command when it finds an URL that it thinks is the URL of a Git repository. In both situations, the crawler must be programmed with this purpose in mind: to clone Git repositories.

All in all, there is no way a web crawler (or an user using a web browser) can clone a Git repository by mistake.

A web crawler does not even need to clone Git repositories from Github or from other web servers that serve Git repositories. It can get each and every version of all the files contained in the repository by using the links the (Github or another) web server provides.

143

answered Sep 21 '22 11:09

axiac

Related questions
                            
                                Xcode 5 - No Remotes Found
                            
                                dyld: lazy symbol binding failed: can't resolve symbol
                            
                                How to get remote branch name in git pre-push hook
                            
                                Search for string in dangling commits in Git
                            
                                Changing branches in git results in changed files
                            
                                Trigger a TeamCity build on changes in a git submodule
                            
                                Include submodules as well in git checkout-index
                            
                                git apply does output nothing and does not patch anything
                            
                                Repository Migration to Gerrit, git push --mirror returns refs/meta/config (cannot delete project configuration)
                            
                                git push error due to non-existent large file
                            
                                How to revert a file in Git that has been renamed
                            
                                Can I hg clone a git repository from BitBucket?
                            
                                What is the difference between git commit and git commit-tree
                            
                                Ansible: Install package with pip from a private git repo
                            
                                Visual Studio Team Explorer + VS Online Sync error
                            
                                Versioning Mysql Data (Not Just Schema)
                            
                                exclude Jest snapshots from git whitespace check
                            
                                Git: cannot do a partial commit during a merge (SourceTree)
                            
                                Can't determine if a branch has been merged when squash is used
                            
                                Setup GitFlow in VSTS - Best practices?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Do bots/spiders clone public git repositories?

Tags:

git

github

git-clone

search

web-crawler

Sean

People also ask

1 Answers

axiac

Recent Activity

Donate For Us