From the online discussion groups and blogs, I have seen a lot of interview questions are related to handling large scale dataset. I am wondering is there a systematic approach to analyze this type of questions? Or in more specific, is there any data structure or algorithms that can be used to deal with this? Any suggestions are really appreciated.
"Large-scale" data sets fall into several categories that I've seen, each of which presents different challenges for you to think about.
- Data that's too big to fit in memory. Here, some key techniques are:
- Caching data that's frequently used for better performance
- Working on data from a file a chunk at a time instead of trying to read the whole file into memory at once (if you're not working sequentially through the file, this can be particularly challenging!)
- Distributing data across the memory of several machines.
- Data that's too big to fit into a single file because of file system or hardware architecture limits. This is pretty easy to solve – split the file – but there's a practical question in many cases of what a sensible split would be.
- Data that's too big to fit on a single hard disk. Here, mostly the techniques are to buy bigger disks :-), or to distribute the data across several machines.
- Distributing data across multiple machines poses interesting challenges when you need to do analysis or transformations on the data. This is a deep topic with a lot of different approaches and challenges. Map/reduce frameworks like CouchDB and Hadoop have recently become popular tools for research and application in this area.
- Data that's too big for a single database instance. This can be a problem of disk size (ran out of space) or performance (memory cache keeps getting blown, indices have simply gotten too big). Maintaining robustness and performance of data split across multiple DB instances, potentially in multiple data centers, is an area of perennial interest to large enterprises. Here, the choices are:
- Vertical splits (different tables to different DBs)
- Horizontal splits (same table on different DBs, but holding different data)
Other problems often associated with large-scale data sets, but not size-related problems per se, are:
- Data that is coming in fast. Think of systems that need to scale to millions or even billions of transactions per minute.
- Data that is constantly changing. How do you deal with stale data or data that's being modified while you're working on it?