Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

'schema' design for a social network

Tags:

I'm working on a proof of concept app for a twitter style social network with about 500k users. I'm unsure of how best to design the 'schema'

should I embed a user's subscriptions or have a separate 'subscriptions' collection and use db references? If I embed, I still have to perform a query to get all of a user's followers. e.g.

Given the following user:

{
 "username" : "alan",
 "photo": "123.jpg",
 "subscriptions" : [
    {"username" : "john", "status" : "accepted"},
    {"username" : "paul", "status" : "pending"}
  ]
}

to find all of alan's subscribers, I'd have to run something like this:

db.users.find({'subscriptions.username' : 'alan'});

from a performance point of view, is that any worse or better than having a separate subscriptions collection?

also, when displaying a list of subscriptions/subscribers, I am currently having problems with n+1 because the subscription document tells me the username of the target user but not other attributes I may need such as the profile photo. Are there any recommended practices for such situations?

thanks Alan

like image 379
Alan B Avatar asked May 15 '10 10:05

Alan B


People also ask

Which database is used for social media?

It is a NoSQL database. If the database is going to be embedded in the application, then SQLite would be ideal. If the database is going to house the communication data of the users, then you'll need a back-end platform. Facebook uses MySQL.

What is data schema example?

In MySQL, schema is synonymous with database. You can substitute the keyword SCHEMA for DATABASE in MySQL SQL syntax. Some other database products draw a distinction. For example, in the Oracle Database product, a schema represents only a part of a database: the tables and other objects are owned by a single user.


2 Answers

First off, you should know the tradeoffs you are going to get with MongoDB and any other NoSQL database (but realize that I am a fan of it). If you are trying to normalize your data completely, you are making a big mistake. Even in relational databases, the larger your app gets, the more your data gets denormalized (see this post by Hot Potato). I've seen this time and time again. You should not go nuts and make a huge mess, but don't worry about repeating information in two places. One of the major points (in my opinion) of NoSQL is that your schema moves into your code and not solely into the database.

Now, to answer your question, I think your initial strategy is what I would do. MongoDB can place indexes on elements which are arrays, so that will make things a lot faster if you are looking for how many friendships a user has. But in reality, the only way to really be sure is to run some sort of test program that generates a database full of names and relationships.

You can script up some input in Python or Perl or whatever you like, and use a file of names to generate some relationships. Check out the Census website, which has a list of last names. Download the file dist.all.last and write some program like:

#! /usr/bin/env python
import random as rand

f = open('dist.all.last')
names = []
for line in f:
  names.append(line.split()[0])

rels = {}
for name in names:
  numOfFriends = rand.randint(0, 1000)
  rels[name] = []
  for i in range(numOfFriends):
    newFriend = rand.choice(names)
    if newFriend != name: #cannot be friends with yourself
      rels[name].append(newFriend)

# take relationships (i.e. rels) and write them to MongoDB

Also, as a general note, your fieldnames seem kind of long. Remember that the fieldnames are repeated with every document in that collection because you cannot rely on one field being in any other document. To save space, a general strategy is to use shorter fieldnames like "unam" instead of "username", but that's a small thing. See the great advice in these two posts.

EDIT:

Actually, in pondering your problem a little more, I would make one more suggestion: break up the subscription types into different fields to make the indexes more efficient. For example, instead of:

{
 "username" : "alan",
 "photo": "123.jpg",
 "subscriptions" : [
    {"username" : "john", "status" : "accepted"},
    {"username" : "paul", "status" : "pending"}
  ]
}

As you said above, I would do this:

{
 "username" : "alan",
 "photo": "123.jpg",
 "acc_subs" : [ "john" ],
 "pnd_subs" : [ "paul" ]
}

So that you could have an index for each type of subscription, thus making queries like "Hoy many people have Paul as pending?" and "How many people subscribe to Paul?" super fast either way. Mongo's indexing over array'd values is truly an epic win.

like image 55
daveslab Avatar answered Sep 24 '22 00:09

daveslab


@Alan B: I think that you're totally getting MongoDB. I agree with @daveslab version of the data, but you'll probably want to add "followers" too.

{
 "username" : "alan",
 "photo": "123.jpg",
 "acc_subs" : [ "john" ],
 "pnd_subs" : [ "paul" ]
 "acc_fol" : [ "mike", "ray" ],
 "pnd_fol" : [ "judy" ]
}

Yes it's duplicate information. It's up to the "business layer" to ensure that this data is correctly update in both spots. Unfortunately there are no transactions in Mongo, fortunately, you have the $addToSet operation, so you're pretty safe.

like image 45
Gates VP Avatar answered Sep 22 '22 00:09

Gates VP