I have a development site https://text-domain.com. (not a real site) When I go to https://duckduckgo.com and search for text-domain.com, it does return results. What have I tried so far: Created <code>robots.txt</code> file with following code(put in in my root directory i.e in text-domain.com/robots.txt): <pre class="prettyprint"><code>User-agent: * Disallow: / </code></pre> Then added meta-tag like this in my template file: <pre class="prettyprint"><code><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </code></pre> Even after doing this, I searched on DuckDuckGo and it yielded the same result. Any suggestions would be welcome. P.S Hi, after waiting for few days there are 2 findings: <ul> <li>Still, the search results are fetched.</li> <li> But I see an message for that result saying : "We would like to show you a description here but the site won't allow us." Is it possible to completely block from showing in the results? </li> </ul>

DuckDuckGo should honour your <code>robots.txt</code>. Their bot <code>DuckDuckBot</code> is documented at https://duckduckgo.com/duckduckbot. But note: the DuckDuckGo bot isn’t crawling everything itself (as DuckDuckGo gets results from other sources), so your pages might still show up if you don’t block the bots of these other sources (like Bing). Refer to mlissner’s answer for more details. With <code>robots.txt</code>, there are two things to consider: <ul> <li>It takes time until changes in your <code>robots.txt</code> are recognized. You have to wait until the relevant bot visits your site again.</li> <li>Even if your URLs are blocked in the <code>robots.txt</code>, search engines may still list your URLs in their search results (without crawled metadata like title and description).</li> </ul> <hr> Using the <code>robots</code>-<code>meta</code> element with <code>noindex</code> would prevent even listing the URLs in search engines like Google, but DDG doesn’t seem to support it. Note that you used wrong quotation marks in your example. It should be <pre class="prettyprint"><code><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </code></pre> instead of <pre class="prettyprint"><code><META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”> </code></pre>

Block a site from search engine - DuckDuckGo

Tags:

web-crawler

robots.txt

robot

duckduckgo

I have a development site https://text-domain.com. (not a real site) When I go to https://duckduckgo.com and search for text-domain.com, it does return results.

What have I tried so far:

Created robots.txt file with following code(put in in my root directory i.e in text-domain.com/robots.txt):

User-agent: *
Disallow: /

Then added meta-tag like this in my template file:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

Even after doing this, I searched on DuckDuckGo and it yielded the same result. Any suggestions would be welcome.

P.S

Hi, after waiting for few days there are 2 findings:

Still, the search results are fetched.
But I see an message for that result saying : "We would like to show you a description here but the site won't allow us."

Is it possible to completely block from showing in the results?

816

asked Aug 06 '13 12:08

Vimalnath

2 Answers

DuckDuckGo is an odd duck when it comes to inclusion in their results. I've done a fair bit of research on this topic across a number of search engines and have had some email back and forth with DDG.

Here's the deal. They get their content from other search engines, as listed here. To my knowledge their search results don't indicate which search engine was its source, so for your content to be removed you need to basically go upstream to all of their sources and get your content removed from there. If that sounds onerous, don't worry — you'd want to do that anyway, right?

DDG does have its own crawler as well, aptly called the DuckDuckBot. It does not honor the noindex HTML tag, nor the HTTP header (it does honor robots.txt), but that doesn't seem to matter because no new results are created by the DuckDuckBot. To my knowledge, this isn't documented anywhere, but I spoke with their staff, which I quote below:

DDG says (2014-06-06):

We get our results from multiple sources and our own crawler wouldn't be the cause of your [problem]. Our crawler only does very specific tasks, like looking (and not actually crawling) parked domains, spam sites, etc.

If there are results from [your website] appearing on DuckDuckGo and shouldn't be, they're likely flowing from one of our upstream sources. If removed there, then they'll stop showing in our results.

I respond:

OK, so nothing gets put in your index via your crawlers, which indeed do not support noindex HTML or HTTP tags?

They confirm:

Yep! Sorry for the confusion and, if you see anything out of the ordinary, please feel free to let us know.

So then the only remaining question is how do you remove your content from the upstream providers. For that, I point you to my blog since it differs by provider. The crux of it is:

Use noindex HTML meta tag and x-robots HTTP tag (for images and such) to tell search engines not to include something in their results;
List your entire website in your sitemap.xml file so that all search engines can find it there.
Use robots.txt to block the search engines that do not support noindex or x-robots tag.

And for bonus points:

Set your sitemaps.xml files so they have noindex set up (and thus won't show up in search results).
Do likewise for your robots.txt file.

It's a complicated world.

answered Sep 23 '22 08:09

mlissner

DuckDuckGo should honour your robots.txt. Their bot DuckDuckBot is documented at https://duckduckgo.com/duckduckbot.

But note: the DuckDuckGo bot isn’t crawling everything itself (as DuckDuckGo gets results from other sources), so your pages might still show up if you don’t block the bots of these other sources (like Bing). Refer to mlissner’s answer for more details.

With robots.txt, there are two things to consider:

It takes time until changes in your robots.txt are recognized. You have to wait until the relevant bot visits your site again.
Even if your URLs are blocked in the robots.txt, search engines may still list your URLs in their search results (without crawled metadata like title and description).

Using the robots-meta element with noindex would prevent even listing the URLs in search engines like Google, but DDG doesn’t seem to support it.

Note that you used wrong quotation marks in your example. It should be

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

instead of

<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>

answered Sep 23 '22 08:09

unor

Related questions
                            
                                Scrapy: CrawlSpider Rules process_links vs process_request vs download middleware [duplicate]
                            
                                how to get html output page in ABOT C# Web Crawler?
                            
                                NCrawler Examples/guides
                            
                                How do I crawl an infinite-scrolling page?
                            
                                Scraping data out of facebook using scrapy
                            
                                selenium.common.exceptions.WebDriverException: Message: Service
                            
                                Where can I obtain a list of User Agents for SEO bots? [closed]
                            
                                How to set Robots.txt or Apache to allow crawlers only at certain hours?
                            
                                Good source of Crawler / Spider IP addresses
                            
                                python website language detection
                            
                                python RE findall() return value is an entire string
                            
                                Web crawler - following links
                            
                                robots.txt: disallow all but a select few, why not? [closed]
                            
                                What does it mean to say a web crawler is I/O bound and not CPU bound?
                            
                                how to detect search engine visites on my site? like phpBB
                            
                                Can't get through a form with scrapy
                            
                                How to follow all links in CasperJS?
                            
                                Scrapy BaseSpider: How does it work?
                            
                                Is it possible to programatically login to a website with C#?
                            
                                Why is website crawling taking forever?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With