I'm searching a way to find "all" the sites ending with an given tld. I had several ideas on how to realize that, but I'm not sure what is the best/most effectiv way to realize this. I'm aware that pages that are linked nowhere aren't findable by spiders etc, so fir this example I'll not care about isolated pages. What I want to do, I want to have an TLD as input for my programm, and I which to have a list of sites as output. For example:
# <program> .de
- spiegel.de
- deutsche-bank.de
...
- bild.de
So what is the best way to reach this? Are there tools available to help me, or how would you program this?
This answer might be a bit late but I've just found this.
You could try using Common Crawler awesome data.
So, what is Common Crawler?
Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.
Using their url search tool query for .de
then download the result as a json
file.
You will get a nice file of results then you will need to do some work on it since it includes all the site map of a domain (hence crawling).
Another drawback that some sites use unwelcoming robot.txt
file so crawlers won't be included them still it's the best result i could find so far.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With