Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding "all" domains of a country

I'm searching a way to find "all" the sites ending with an given tld. I had several ideas on how to realize that, but I'm not sure what is the best/most effectiv way to realize this. I'm aware that pages that are linked nowhere aren't findable by spiders etc, so fir this example I'll not care about isolated pages. What I want to do, I want to have an TLD as input for my programm, and I which to have a list of sites as output. For example:

# <program> .de
- spiegel.de
- deutsche-bank.de
...
- bild.de

So what is the best way to reach this? Are there tools available to help me, or how would you program this?

like image 685
user1620678 Avatar asked Aug 23 '12 18:08

user1620678


1 Answers

This answer might be a bit late but I've just found this.

You could try using Common Crawler awesome data.

So, what is Common Crawler?

Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.

Using their url search tool query for .de then download the result as a json file.

You will get a nice file of results then you will need to do some work on it since it includes all the site map of a domain (hence crawling).

Another drawback that some sites use unwelcoming robot.txt file so crawlers won't be included them still it's the best result i could find so far.

like image 52
Nimir Avatar answered Jan 01 '23 13:01

Nimir