Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Graph Data Structures with millions of nodes (Social network)

In the context of design of a social network using Graphs data structure, where you can perform a BFS to find a connection from one person to another, I have some questions pertaining to it.

If there are million users, the topology would indeed be much more complicated and interconnected than the graphs we normally design and I am trying to comprehend how you could solve these problems.

  1. In the real world, servers fail. How does this affect you?

  2. How could you take advantage of caching?

  3. Do you search until the end of the graph (infinite)? How do you decide when to give up?

  4. In real life, some people have more friends of friends than others, and are therefore more likely to make a path between you and someone else. How could you use this data to pick where you start traverse?
like image 701
Legolas Avatar asked Oct 23 '11 03:10

Legolas


People also ask

Which data structure is used in social networking sites?

The "rectangular" data structure (called an "attribute" data set) is used in a number of ways in network analysis. It can record attributes of each actor that we know from other sources (e.g. gender, age, etc.).

How are graphs used in social networks?

The Social Network graph helps you to visualize the relationships between the selected entity and all entities that the selected entity is linked to. Using this unique graph, you get another way to see "who knows who".

Does Facebook use graph data structure?

For example, Facebook uses a graph data structure that consists of a group of entities and their relationships. On Facebook, every user, photo, post, page, place, etc. that has data is represented with a node. Every edge from one node to another represents their relationships, friendships, ownerships, tags, etc.

Which kind of graphs are used for Facebook and twitter network data analysis?

Random Graphs. To simulate how a social network forms, mathematicians use random graphs that model how people make connections as they enter the network. Random graphs are developed by adding nodes to the graph one by one and randomly adding edges between nodes according to a probabilistic rule.


1 Answers

Your question seems interesting and curious :)

1) Well... of course, data is stored in disks, not in ram. Disks have systems that avoid failure, in particular, RAID-5 for example. Redundancy is the key: if one system fail there is another system ready to take his place. There is also redundancy and workload sharing together... there are two computers that work in parallel and share their jobs but if one stops only one works and take the full workload.

In places like google or facebook redundancy is not 2, is 1200000000 :) And consider also that data is not in a single server farm, in google there are several datacenters connected together, so if one building explodes, another one will take his place for example.

2) Not an easy question at all, but usually these systems have big cache for disk arrays too, so reading and writing data on disk is faster than on our laptops :) Data can be processed in parallel by several concurrent systems and this is the key of the speed of services like facebook.

3) The end of the graph is not infinite. So it is possible with actual technology indeed.

The computational complexity of exploring all connections and all nodes on a graph is O(n + m) where n is the number of vertices and m the number of edges. This means, it is linear to the number of registered user and to the number of connection between users. And RAM these days is very cheap.

Being a linear growth is easy to add resources when needed. Add more computers the more you get rich :)

Consider also that no-one will perform a real search for every node, everything in facebook is quite "local", you can view the direct friend of one person, not the friend of friend of friend .... it would be not useful.

Getting the number of vertices directly connected to a vertex, if the data structure is well done, is very easy and fast. In SQL it would be a simple select and if tables are well indexed it will be very fast and also not very dependant on the total number of users (see the concept of hash tables).

like image 176
Salvatore Previti Avatar answered Sep 28 '22 05:09

Salvatore Previti