Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do craigslist mashups get data? [closed]

I'm doing some research work into content aggregators, and I'm curious how some of the current craigslist aggregators get data into their mashups.

For example, www.housingmaps.com and the now closed www.chicagocrime.org

If there is a URL that can be used for reference, that would be perfect!

like image 201
pearcewg Avatar asked Oct 25 '08 22:10

pearcewg


7 Answers

For AdRavage.com I use a combination of Magpie RSS (to extract the data returned from searches) and a custom screen scraping class to properly populate the city/category information used when building searches.

For example, to extract the categories you could:

//scrape category data
$h = new http();
$h->dir = "../cache/"; 
$url = "http://craigslist.org/";

if (!$h->fetch($url, 300)) {
  echo "<h2>There is a problem with the http request!</h2>";      
  exit();
}

//we need to get all category abbreviations (data looks like: <option value="ccc">community)
preg_match_all ("/<option value=\"(.*)\">([^`]*?)\n/", $h->body, $categoryTemp);

$catNames = $categoryTemp['2']; 

//return the array of abreviations
if(sizeof($catNames) > 0)   
    return $catNames;   
else
    return $emptyArray = array();
like image 118
cfay Avatar answered Oct 04 '22 17:10

cfay


An alternative to scraping (and getting blocked), using frames, or Google search is to use a data broker or data exchange service.

3taps is a beta service which provides a developer API to many services, including Craigslist. Their team also built Craiggers to demonstrate a use case of this API. Founder Greg Kidd told me that 3taps harvests Craigslist data from non-Craigslist sources where it is already indexed and cached so that it doesn't put any strain on Craigslist. Other 3taps data sources are also listed, but these stats make it unclear whether they're currently supported. Their goal is to Democratize the Exchange of Data.

80legs is a crawling service which provides a less real-time but potentially more comprehensive option. Their data dump-style service includes crawl packages for hundreds of sites sites including Amazon, Facebook, and Zillow (I don't believe Craigslist currently). Their newer effort Datafiniti is providing a search engine over this type of data.

like image 20
John Lehmann Avatar answered Oct 04 '22 17:10

John Lehmann


The alternative option would be to use YQL or Yahoo pipes to gather the results.

Craiglook and HousingMaps are using them to gather results

like image 24
Rory Avatar answered Oct 04 '22 17:10

Rory


The problem with any scraping solution of craigslist is that they automatically block any IP address that accesses them 'too much' - which usually means more than a few hundred times a day. So as soon as your tool got any kind of popularity, it would be shut down.

That's why the only craigslist search sites that have lasted either use frames (like searchtempest.com and crazedlist.org) or google (like allofcraigs.com).

What 3taps does is to gather craigslist listing from third party sources 'in the wild' - things like the Google and Bing caches for example.

Edit: this answer is no longer up to date. Most classifieds search engines that include results from craigslist now use Google Custom Search or similar solutions from Yahoo or Bing. SearchTempest uses both. Allofcraigs is now adhuntr and uses Google. Crazedlist has shut down.

like image 38
Nathan Stretch Avatar answered Oct 04 '22 17:10

Nathan Stretch


I've done a lot of data aggregation from sites like eBay, Craigslist, and Zillow. Each source requires a different method to aggregate the data.

For Craigslist, I got the data using RSS feeds. I only wanted specific data in specific categories in specific cities, and the RSS feeds worked fine for me. If you're trying to get all the data, and you overuse the RSS feeds, Craigslist will likely ban you. Also, you won't be able to get all the data from Craigslist feeds, because the feeds show most of the data but not all. If your reliability doesn't need to be 100%, then RSS is the easiest way to do it.

like image 28
Brandon Avatar answered Oct 04 '22 16:10

Brandon


i am guessing screen scraping

i do not think there is a craigslist API yet.. and i do not think they will release one..

so the only way to go is to scrape data.. you could use cURL library and heave regex to scrape the data you want of a page

if you see a link .. access the page.. scrape the new page get the data and show it or store it

and so on..

like image 34
Wael Awada Avatar answered Oct 04 '22 17:10

Wael Awada


I just made one:

http://cdn.javascriptmvc.com/videos/jobs/craigslist.js

That produces:

http://cdn.javascriptmvc.com/videos/jobs/craigslist.html

Must be run in rhino.

like image 28
Justin Meyer Avatar answered Oct 04 '22 16:10

Justin Meyer