I'm trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I'm not sure, they may or may not be a web crawler and they are causing many downloads so it's important for me to know. Is there somewhere a list of know web crawlers with some documentation like user agent, IPs, behavior, etc? I'm not interested in the official ones, like Google's, Yahoo's, or Microsoft's. Those are generally well behaved and self-indentified.

I'm using http://www.user-agents.org/ usually as reference, hope this helps you out. You can also try http://www.robotstxt.org/db.html or http://www.botsvsbrowsers.com.

I'm maintaining a list of crawler's user-agent patterns at https://github.com/monperrus/crawler-user-agents/. It's collaborative, you can contribute to it with pull requests.

Is there a list of known web crawlers? [closed]

Tags:

I'm trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I'm not sure, they may or may not be a web crawler and they are causing many downloads so it's important for me to know.

Is there somewhere a list of know web crawlers with some documentation like user agent, IPs, behavior, etc?

I'm not interested in the official ones, like Google's, Yahoo's, or Microsoft's. Those are generally well behaved and self-indentified.

213

asked Nov 14 '09 07:11

pupeno

4 Answers

I'm using http://www.user-agents.org/ usually as reference, hope this helps you out.

You can also try http://www.robotstxt.org/db.html or http://www.botsvsbrowsers.com.

answered Nov 09 '22 10:11

Jaan J

I'm maintaining a list of crawler's user-agent patterns at https://github.com/monperrus/crawler-user-agents/.

It's collaborative, you can contribute to it with pull requests.

answered Nov 09 '22 11:11

Martin Monperrus

http://www.robotstxt.org/db.html is a good place to start. They have an automatable raw feed if you need that too. http://www.botsvsbrowsers.com/ is also helpful.

answered Nov 09 '22 11:11

Justin Grant

Unfortunately we've found that bot activity is too numerous and varied to be able to accurately filter it. If you want accurate download counts, your best bet is to require javascript to trigger the download. That's basically the only thing that is going to reliably filter out the bots. It's also why all site traffic analytics engines these days are javascript based.

answered Nov 09 '22 10:11

jwanagel

Related questions
                            
                                Comparing two large lists in python
                            
                                In Python: How to remove an object from a list if it is only referenced in that list?
                            
                                What are the List or ArrayList declaration differences in Java?
                            
                                Sort list of dictionaries by multiple keys with different ordering
                            
                                Efficiently remove all NULL values in a list and all sublists
                            
                                Why summing native lists is slower than summing church-encoded lists with `GHC -O2`?
                            
                                Why does the following code sort the List of objects?
                            
                                Why can itertools.groupby group the NaNs in lists but not in numpy arrays
                            
                                generating word cloud for items in a list in python
                            
                                Duplicate list names in R
                            
                                Difference between '[:]' and '[::]' slicing when copying a list?
                            
                                In-place modification of Python lists
                            
                                Ordered lists inside an Android TextView
                            
                                How Does List<T>.Contains() Find Matching Items?
                            
                                Why does Arrays.asList return a fixed-size List?
                            
                                Can I build a list, and sort it at the same time?
                            
                                Why isn't std::list.size() constant-time? [duplicate]
                            
                                Using and declaring generic List<T>
                            
                                Why is my object properly removed from a list when __eq__ isn't being called?
                            
                                Formatting Lists into columns of a table output (python 3)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a list of known web crawlers? [closed]

Tags:

list

documentation

bots

web-crawler