Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a list of known web crawlers? [closed]

I'm trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I'm not sure, they may or may not be a web crawler and they are causing many downloads so it's important for me to know.

Is there somewhere a list of know web crawlers with some documentation like user agent, IPs, behavior, etc?

I'm not interested in the official ones, like Google's, Yahoo's, or Microsoft's. Those are generally well behaved and self-indentified.

like image 213
pupeno Avatar asked Nov 14 '09 07:11

pupeno


People also ask

Can websites block web crawlers?

Do you have certain parts of a website that you don't want indexed by a search engine? If so, you can block search web crawlers from the page or pages that you want to be web crawler free. It is much easier than you might think and it all has to do with a file called robots. txt.

Is Yahoo a web crawler?

Starting on April 7, 2003, Yahoo! Search became its own web crawler-based search engine.

Is Google a web crawler?

Most of our Search index is built through the work of software known as crawlers. These automatically visit publicly accessible webpages and follow links on those pages, much like you would if you were browsing content on the web.


4 Answers

I'm using http://www.user-agents.org/ usually as reference, hope this helps you out.

You can also try http://www.robotstxt.org/db.html or http://www.botsvsbrowsers.com.

like image 83
Jaan J Avatar answered Nov 09 '22 10:11

Jaan J


I'm maintaining a list of crawler's user-agent patterns at https://github.com/monperrus/crawler-user-agents/.

It's collaborative, you can contribute to it with pull requests.

like image 32
Martin Monperrus Avatar answered Nov 09 '22 11:11

Martin Monperrus


http://www.robotstxt.org/db.html is a good place to start. They have an automatable raw feed if you need that too. http://www.botsvsbrowsers.com/ is also helpful.

like image 27
Justin Grant Avatar answered Nov 09 '22 11:11

Justin Grant


Unfortunately we've found that bot activity is too numerous and varied to be able to accurately filter it. If you want accurate download counts, your best bet is to require javascript to trigger the download. That's basically the only thing that is going to reliably filter out the bots. It's also why all site traffic analytics engines these days are javascript based.

like image 3
jwanagel Avatar answered Nov 09 '22 10:11

jwanagel