Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does a site like kayak.com aggregate content? [closed]

Greetings, I've been toying with an idea for a new project and was wondering if anyone has any idea on how a service like Kayak.com is able to aggregate data from so many sources so quickly and accurately. More specifically, do you think Kayak.com is interacting with APIs or are they crawling/scraping airline and hotel websites in order to fulfill user requests? I know there isn't one right answer for this sort of thing but I'm curious to know what others think would be a good way to go about this. If it helps, pretend you are going to create kayak.com tomorrow ... where is your data coming from?

like image 540
Jeff Avatar asked Jan 05 '11 17:01

Jeff


4 Answers

I'm working in travel industry as a software architect / project lead on the precisely kind of project you describe - in our region we work with suppliers directly, but for outgoing we connect to several aggregators.

To answer your question... some data you have, some you get in various ways, and some you have to torture and twist until it confesses.

What's your angle?

The questions you have to ask are... Do you want to sell advertising like Kayak or do you take a cut like Expedia? Are you into search or into selling travel services? Do you target niche (for example, just air travel) or everything (accommodation, airlines, rent-a-car, additional services like transport/sightseeing/conferences etc)? Do you target region (US or part of US) or the world? How deep do you go - do you just show several sites on a single screen, or do you bundle different services together and package them dynamically?

Getting the data

If you're going with Kayak business model, you technically don't need site's permission... but a lot of sites have affiliate programs with IFrames or other simple ways to direct the customer to their site. On the plus side, you don't have to deal with payments/complaints and travelers themselves. As for the cons... if you want to compare prices yourself and present the cheapest option to the user, you'll have to integrate on a deeper level, and that means APIs and web scraping.

As for web scraping... avoid it. It sucks. Really. Just don't do it. Trust me on this one. For example, some things like lowcosters you can't get without web scraping. Low cost airlines live from value added services. If the user doesn't see their website, they don't sell extra stuff, and they don't earn anything. Therefore, they don't have affiliates, they don't offer APIs, and they change their site layout almost constantly. However, there are companies which earn a living by web scraping lowcoster's sites and wrapping them into nice APIs. If you can afford them, you can give your users cost-comparison of low cost flights and that's huge.

On the other hand, there are "normal" carriers which offer APIs. It's not that big of a problem to get to airlines since they're all united under IATA; basically, you buy from IATA, and IATA distributes the money to carriers. However, you probably don't want to connect directly to carrier network. They have web services and SOAP these days, but believe me when I say that there are SOAP protocols which are just an insanely thin wrappers around a text prompt through which you can interact with a mainframe with an 80es-style protocol (think of a Unix prompt where you're billed per command; and it takes about 20 commands to do one search). That's why you probably want to connect to somebody a bit more down the food chain, with a better API.

Airlines are thus on both extremes of Gaussian curve; on one side are individual suppliers, and on the other highly centralized systems where you implement one API and you're able to fly anywhere in the world. Accommodation and the rest of travel products are in between. There are several big players which aggregate hotels, and a ton of small suppliers with a lot of aggregators which cover only part of a spectrum. For example, you can rent a lighthouse and it's even not that expensive - but you won't be able to compare the prices of different lighthouses in one place.

If you're into Kayak business model, you'll probably end up scraping websites. If you're into integrating different providers, you'll often work with APIs, some of which are pretty good, and most of which are tolerable. I haven't worked with RSS but there's not a lot of difference between RSS and web scraping. There is also a fourth option not mentioned in Jeff's answer... the one where you get your data nightly, for example .CSV files through FTP and similar.

Life sucks (mini-rant)

And then there's complexity. The more value you want to add, the more complexity you'll have to handle. Can you search accommodations which allow pets? For a hostel which is located less than 5 km from the town center? Are you combining flights, and are you able to guarantee that the traveler will have enough time to get from one airport to another... can you sell the transport in advance? A famous cellist doesn't want to part from his precious 18th century cello; can you sell him another seat for the cello (yep, not making this one up)?

Want to compare prices? Sure, the room is EUR 30 per night. But you can either get one double for 30 and one single for 20, or you can get one extra bed in a double and get 70% off for third person. But only if it's a child under 12 years of age; our extra beds are not for adults. And you don't get the price for extra bed in search results - only when you calculate the final price.

And don't even get me started on dynamic packaging. Want to sell accommodation + rent-a-car? No problem; integrate with two different providers, and off you go... manually updating list of locations in the city (from rent-a-car provider) to match with hotels (from accommodation provider, who gives you only the city for each hotel). Of course, provided that you've already matched the list of cities from the two, since there is no international standard for city codes.

Unlike a lot of other industries which have many products, travel industry has many very complex products. Amazon has it easy; selling books and selling potatoes, it's the same thing; you can even ship them in the same box. They combine easily and aren't assembled from many parts. :)

P.S. Linking to an interesting recent thread on Hacker News with some insider info regarding flights. P.P.S. Recently stumbled on a great albeit rather old blogpost on IATA's NDC protocol with overview of how travel industry is connected and a history lesson how this came to be.

like image 50
Domchi Avatar answered Nov 19 '22 22:11

Domchi


They use a software package like ITA Software, which is one of the companies Google is in the process of picking up.

like image 9
mavrck Avatar answered Nov 19 '22 22:11

mavrck


Only 3 ways I know of to get data from websites.

RSS Feeds - We use rss feeds a lot at my company to integrate existing site's data with our apps. It's fast and most sites already have an RSS feed available. The problem with this is not all sites implement the RSS standard properly so if you're pulling data from many RSS feeds across many sites then make sure you write your code so that you can add exceptions and filters easily.

APIs - These are nice if they are designed well and have all the information you need, however that's not always the case, plus if the sites are not using a standard api format then you'll have to support multiple API's.

Web Scraping - This method would be the most unreliable as well as the most expensive to maintain. But if you're left with nothing else it can be done.

like image 8
Jeff Busby Avatar answered Nov 19 '22 22:11

Jeff Busby


This article says that Kayak was asked to stop scrapping a certain airlines page. That leads me to believe that they probably do scraping on sites that they don't have a relationship with (and a data feed that comes with that relationship).

like image 3
Jake Wilson Avatar answered Nov 19 '22 22:11

Jake Wilson