Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are best practices for collecting, maintaining and ensuring accuracy of a huge data set?

I am posing this question looking for practical advice on how to design a system.

Sites like amazon.com and pandora have and maintain huge data sets to run their core business. For example, amazon (and every other major e-commerce site) has millions of products for sale, images of those products, pricing, specifications, etc. etc. etc.

Ignoring the data coming in from 3rd party sellers and the user generated content all that "stuff" had to come from somewhere and is maintained by someone. It's also incredibly detailed and accurate. How? How do they do it? Is there just an army of data-entry clerks or have they devised systems to handle the grunt work?

My company is in a similar situation. We maintain a huge (10-of-millions of records) catalog of automotive parts and the cars they fit. We've been at it for a while now and have come up with a number of programs and processes to keep our catalog growing and accurate; however, it seems like to grow the catalog to x items we need to grow the team to y.

I need to figure some ways to increase the efficiency of the data team and hopefully I can learn from the work of others. Any suggestions are appreciated, more though would be links to content I could spend some serious time reading.

like image 906
Kyle West Avatar asked Dec 22 '10 01:12

Kyle West


3 Answers

Use visitors.

  1. Even if you have one person per item, there will be wrong records, and customers will find it. So, let them mark items as "inappropiate" and make a short comment. But don't forget, they're not your employees, don't ask them too much; see Facebook's "like" button, it's easy to use, and requires not too much energy from the user. Good performance/price. If there would be a mandatory field in Facebook, which asks "why do you like it?", no one should use that function.

  2. Visitors also helps you implicite way: they visit item pages, and use search function (I mean both internal search engine and external ones, like Google). You can gain information from visitors' activity, say, set up the order of the most visited items, then you should concentrate more human forces on the top of the list, and less for the "long tail".

like image 172
ern0 Avatar answered Oct 22 '22 22:10

ern0


Since this is more about managing the team/code/data rather than implementation and since you mentioned Amazon I think you'll find this useful: http://highscalability.com/amazon-architecture.

In particular, click the link to Werner Vogels interview.

like image 27
slebetman Avatar answered Oct 22 '22 21:10

slebetman


Build it right in the first place. Ensure that you use every integrity checking method available in the database you're using, as appropriate to what you're storing. Better that an upload fail than bad data get silently introduced.

Then, figure out what you're going to do in terms of your own integrity checking. DB integrity checks are a good start, but rarely are all you need. That will also force you to think, from the beginning, about what type of data you're working with, how you need to store it, and how to recognize and flag or reject bad or questionable data.

I can't tell you the amount of pain I've seen from trying to rework (or just day to day work with) old systems full of garbage data. Doing it right and testing it thoroughly up front may seem like a pain, and it can be, but the reward is having a system that for the most part hums along and needs little to no intervention.

As for a link, if there's anyone who's had to think about and design for scalability, it's Google. You might find this instructive, it's got some good things to keep in mind: http://highscalability.com/google-architecture

like image 3
Todd Allen Avatar answered Oct 22 '22 23:10

Todd Allen