Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does "DHT search engine" work?

I'm interested in the Btdigg.org which is called a "DHT search engine". According to this article, it doesn't store any content and even has no database. Then how does it work? Doesn't it need to gather meta infos and store them in database like other normal search engines? After a user submit a query, it scans the DHT network and return the results in "real time"? Is this possible?

like image 409
user2025043 Avatar asked Jan 30 '13 11:01

user2025043


People also ask

What is DHT search engine?

BTDigg's DHT search engine links two subjects that are partial information from a torrent and a magnet link, similar to the process of linking the content of a web page with a page URL. BTDigg also provides API for third-party applications. BTDigg Web interface supports English, Russian, Portuguese languages.

How DHT network works?

A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table: key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key.

What is a DHT crawler?

Minimal BitTorrent crawler and scheduler with RethinkDB backend to collect, analyse and store peers.


2 Answers

I don't have specific insight into BTDigg, but I believe the claim that there is not database (or something that acts like a database) is a false statement. The author of that article might have been referring to something more specific that you might encounter in a traditional torrent site, where actual .torrent files are stored for instance.

This is how a BTDigg-like site works:

  1. You run a bunch of DHT nodes, specifically with the purpose of "eaves dropping" on DHT traffic, to be introduced to info-hashes that people talk about.
  2. join those swarms and download the metadata (.torrent file) by using the ut_metadata extension
  3. index the information you find in there, map it to the info-hash
  4. Provide a front-end for that index

If you want to luxury it up a bit you can also periodically scrape the info-hashes you know about to gather stats over time and maybe also figure out when swarms die out and should be removed from the index.

So, the claim that you don't store .torrent files nor any content is true.

It is not realistic to search the DHT in real-time, because the DHT is not organized around keyword searches, you need to build and maintain the index continuously, "in the background".

EDIT:

Since this answer, an optimization (BEP 51) has been implemented in some DHT clients that lets you query which info-hashes they are hosting, significantly reducing the cost of indexing.

like image 172
Arvid Avatar answered Oct 12 '22 22:10

Arvid


For a deep understanding of DHT and its applications, see Scott Wolchok's paper and presentation "Crawling BitTorrent DHTs for Fun and Profit". He presents the autonomous search engine idea as a sidenote to his study of DHT's security:

PDF of his paper:

  • https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf

His presentation at DEFCON 18 (parts 1 & 2)

  • http://www.youtube.com/watch?v=v4Q_F4XmNEc
  • http://www.youtube.com/watch?v=mO3DfLtKPGs
like image 44
martinwguy Avatar answered Oct 12 '22 23:10

martinwguy