Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Good graph traversal algorithm

Abstract problem : I have a graph of about 250,000 nodes and the average connectivity is around 10. Finding a node's connections is a long process (10 seconds lets say). Saving a node to the database also takes about 10 seconds. I can check if a node is already present in the db very quickly. Allowing concurrency, but not having more than 10 long requests at a time, how would you traverse the graph to gain the highest coverage the quickest.

Concrete problem : I'm trying to scrape a website user pages. To discover new users I'm fetching the friend list from already known users. I've already imported about 10% of the graph but I keep getting stuck in cycles or using too much memory remembering too many nodes.

My current implementation :

def run() :
    import_pool = ThreadPool(10)
    user_pool = ThreadPool(1)
    do_user("arcaneCoder", import_pool, user_pool)

def do_user(user, import_pool, user_pool) :
    id = user
    alias = models.Alias.get(id)

    # if its been updates in the last 7 days
    if alias and alias.modified + datetime.timedelta(days=7) > datetime.datetime.now() :
        sys.stderr.write("Skipping: %s\n" % user)
    else :
        sys.stderr.write("Importing: %s\n" % user)
        while import_pool.num_jobs() > 20 :
            print "Too many queued jobs, sleeping"
            time.sleep(15)

        import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user))

    sys.stderr.write("Crawling: %s\n" % user)
    users = crawl(id, 5)
    if len(users) >= 2 :
        for user in random.sample(users, 2) :
            if (user_pool.num_jobs() < 100) :
                user_pool.add_job(do_user, [user, import_pool, user_pool])

def crawl(id, limit=50) :
    '''returns the first 'limit' friends of a user'''
    *not relevant*

Problems of current implementation :

  • Gets stuck in cliques that I've already imported, thereby wasting time and the importing threads are idle.
  • Will add more as they get pointed out.

So, marginal improvments are welcome, as well as full rewrites. Thanks!

like image 787
Paul Tarjan Avatar asked Aug 24 '09 06:08

Paul Tarjan


People also ask

What is a graph traversal algorithm?

In computer science, graph traversal (also known as graph search) refers to the process of visiting (checking and/or updating) each vertex in a graph. Such traversals are classified by the order in which the vertices are visited. Tree traversal is a special case of graph traversal.

Which is the best graph traversal algorithm?

The Depth First Search (DFS) is a graph traversal algorithm. In this algorithm one starting vertex is given, and when an adjacent vertex is found, it moves to that adjacent vertex first and try to traverse in the same manner.

Which is best BFS and DFS?

DFS is faster than BFS. Time Complexity of BFS = O(V+E) where V is vertices and E is edges. Time Complexity of DFS is also O(V+E) where V is vertices and E is edges.

What is the best graph search algorithm?

Dijkstra's algorithm is efficient because it only works with a smaller subset of the possible paths through a graph (i.e., it doesn't have to read through all of your data). After each node is solved, the shortest path from the start node is known and all subsequent paths build upon that knowledge.


2 Answers

To remember IDs of the users you've already visited, you need a map of a length of 250,000 integers. That's far from "too much". Just maintain such a map and only traverse through the edges that lead to the already undiscovered users, adding them to that map at the point of finding such edge.

As far I can see, you're close to implement Breadth-first search (BFS). Check google about the details of this algorithm. And, of course, do not forget about mutexes -- you'll need them.

like image 131
P Shved Avatar answered Oct 04 '22 04:10

P Shved


I am really confused as to why it takes 10 seconds to add a node to the DB. That sounds like a problem. What database are you using? Do you have severe platform restrictions?

With modern systems, and their oodles of memory, I would recommend a nice simple cache of some kind. You should be able to create a very quick cache of user information that would allow you to avoid repeated work. When you have encountered a node already, stop processing. This will avoid cycling forever in cliques.

If you need to allow for rehashing existing nodes after a while, you can use a last_visit_number which would be a global value in the dB. If the node has that number, then this crawl is the one that encountered it. If you want to automatically revisit any nodes, you just need to bump the last_visit_number before starting the crawl.

By your description, I am not quite sure how you are getting stuck.

Edit ------ I just noticed you had a concrete question. In order to increase how quickly you pull in new data, I would keep track of the number of times a given user was linked to in your data (imported or not yet imported). When choosing a user to crawl, I would pick users that have a low number of links. I would specifically go for either the lowest number of links or a random choice among the users with the lowest number of links.

Jacob

like image 28
TheJacobTaylor Avatar answered Oct 04 '22 03:10

TheJacobTaylor