Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tor Web Crawler

Ok, here's what I need. I have a PHP based web crawler. It is accessible here: http://rz7ocnxxu7ka6ncv.onion/ Now, my problem is that my spider that actually crawls pages needs to do so on a SOCKS port 9050. The thing is, I have to tunnel its connection through Tor so that It can resolve .onion domains, which is what I'm indexing. (Only ending in .onion.) I call this script from the command line using php crawl.php, and I add the appropriate parameters to crawl the page. Here is what I think: Is there any way to force it to use Tor? OR can i force my ENTIRE MACHINE to tunnel things through Tor, and how? (Like forcing all traffic through 127.0.0.1:9050) perhaps if i set up global proxy settings, php would respect them?

If any of my solutions work, how would I do it? (Step by step instructions please, I am a noob.)

I just want to crate my own Tor search engine. (Don't recommend my p2p search engines- it's not what I want for this- I know they exist, I did my homework.) Here is the crawler source if you are interested to take a look at: Perhaps someone with a kind heart can modify it to use 127.0.0.1:9050 for all crawling requests? http://pastebin.com/kscGJCc5

like image 946
user1203301 Avatar asked Feb 11 '12 03:02

user1203301


People also ask

What is a Tor crawler?

Dark Web Tor Crawler is a classifier and search engine that collects and classifies unknown information by crawling inside sites targeting hidden onion links and removing fog from gray space. In the deep web or dark web, it is not easy to judge the attributes of a site because information is generally hidden.

Can you scrape the dark web?

Scraping the dark web has unique challenges compared to scraping the surface web. However, it is relatively untapped and can provide excellent cybercrime intelligence operations. While hidden services often employ anti-crawling measures, these can still be bypassed, and provide interesting and useful data.


4 Answers

cURL also supports SOCKS connections; try this:

<?php

$ch = curl_init('http://google.com'); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); 

// SOCKS5
curl_setopt($ch, CURLOPT_PROXY, 'localhost:9050'); 
curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5);

curl_exec($ch); 
curl_close($ch);
like image 162
Ben Bos Avatar answered Oct 06 '22 05:10

Ben Bos


Unless I'm missing something the answer is yes, and here is some documentation on the Tor site. The instructions are pretty specific. Though I've not set Tor up as a proxy it's something I've considered, this is the place I would start.

EDIT: It is dead simple to setup Tor on Linux and use it as a proxy as the documentation suggests.

sudo apt-get install tor
sudo /etc/init.d/tor start

netstat -ant | grep 9050 # verify Tor is running

Now after looking through OPs code we see calls to file_get_contents. While the easiest method to use at first file_get_contents becomes cumbersome when you want to start parametrizing the request because you have to use stream contexts.

First suggestion is to move to curl, but again, more reading on how SOCKS works w/ HTTP is probly in order to truly answer this question... But to answer the question technically, how to send an HTTP request to a Tor SOCKS proxy on localhost, again easy..

<?php  
$ch = curl_init('http://google.com'); 
curl_setopt($ch, CURLOPT_HEADER, 1); 
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1); 
curl_setopt($ch, CURLOPT_PROXY, 'https://127.0.01:9050/'); 
curl_exec($ch); 
curl_close($ch);

But what does Tor tell us?

HTTP/1.0 501 Tor is not an HTTP Proxy

Content-Type: text/html; charset=iso-8859-1

Basically, learn more about SOCKS & HTTP. Another option is to google around for PHP SOCKS clients. A quick inspection reveals a library that claims it can send HTTP requests over SOCKS.

EDIT:

Alright, 1 more edit! Seconds after finishing my last post, I've found a way to do it. This article shows us how to set up something called Privoxy, which translates SOCKS requests into HTTP requests. Put that in front of Tor and blamo, we're sending proxied HTTP requests through Tor!

like image 32
quickshiftin Avatar answered Oct 06 '22 05:10

quickshiftin


you have to intercept the dns lookup request from the php script by configuring tor with the "dnsport" directive. then you have to configure a "transport" for tor and a "virtualnetworkaddress". now what happens when your php script does a dns-lookup thru tor is that tor sees a request for a onion address and answers with a ip address from the "virtualnetworkaddress" range. you now have to redirect the traffic going to this address to the address defined with "transport". read "torrc" manual on "automaphostonresolve", "virtualnetworkaddress", "dnsport" and "transport".

like image 39
freeside Avatar answered Oct 06 '22 03:10

freeside


I think it is as simple as running your command line request with the usewithtor or torifyoption. For example:

$ usewithtor crawl.php

And the script will be able to interact with .onion sites. Having build a crawler for Tor myself, I definitely would not go this route for production use, I instead use python, PySocks, and other crawler libraries instead of CURL. Hopefully this answers your question and gives you some ideas for other implementation strategies moving forward.

Thanks

like image 26
jamescampbell Avatar answered Oct 06 '22 05:10

jamescampbell