Good graph traversal algorithm

Tags:

Abstract problem : I have a graph of about 250,000 nodes and the average connectivity is around 10. Finding a node's connections is a long process (10 seconds lets say). Saving a node to the database also takes about 10 seconds. I can check if a node is already present in the db very quickly. Allowing concurrency, but not having more than 10 long requests at a time, how would you traverse the graph to gain the highest coverage the quickest.

Concrete problem : I'm trying to scrape a website user pages. To discover new users I'm fetching the friend list from already known users. I've already imported about 10% of the graph but I keep getting stuck in cycles or using too much memory remembering too many nodes.

My current implementation :

Click to copy

def run() :
    import_pool = ThreadPool(10)
    user_pool = ThreadPool(1)
    do_user("arcaneCoder", import_pool, user_pool)

def do_user(user, import_pool, user_pool) :
    id = user
    alias = models.Alias.get(id)

    # if its been updates in the last 7 days
    if alias and alias.modified + datetime.timedelta(days=7) > datetime.datetime.now() :
        sys.stderr.write("Skipping: %s\n" % user)
    else :
        sys.stderr.write("Importing: %s\n" % user)
        while import_pool.num_jobs() > 20 :
            print "Too many queued jobs, sleeping"
            time.sleep(15)

        import_pool.add_job(alias_view.import_id, [id], lambda rv : sys.stderr.write("Done Importing %s\n" % user))

    sys.stderr.write("Crawling: %s\n" % user)
    users = crawl(id, 5)
    if len(users) >= 2 :
        for user in random.sample(users, 2) :
            if (user_pool.num_jobs() < 100) :
                user_pool.add_job(do_user, [user, import_pool, user_pool])

def crawl(id, limit=50) :
    '''returns the first 'limit' friends of a user'''
    *not relevant*

Problems of current implementation :

Gets stuck in cliques that I've already imported, thereby wasting time and the importing threads are idle.
Will add more as they get pointed out.

So, marginal improvments are welcome, as well as full rewrites. Thanks!

787

asked Aug 24 '09 06:08

Paul Tarjan

2 Answers

To remember IDs of the users you've already visited, you need a map of a length of 250,000 integers. That's far from "too much". Just maintain such a map and only traverse through the edges that lead to the already undiscovered users, adding them to that map at the point of finding such edge.

As far I can see, you're close to implement Breadth-first search (BFS). Check google about the details of this algorithm. And, of course, do not forget about mutexes -- you'll need them.

131

answered Oct 04 '22 04:10

P Shved

I am really confused as to why it takes 10 seconds to add a node to the DB. That sounds like a problem. What database are you using? Do you have severe platform restrictions?

With modern systems, and their oodles of memory, I would recommend a nice simple cache of some kind. You should be able to create a very quick cache of user information that would allow you to avoid repeated work. When you have encountered a node already, stop processing. This will avoid cycling forever in cliques.

If you need to allow for rehashing existing nodes after a while, you can use a last_visit_number which would be a global value in the dB. If the node has that number, then this crawl is the one that encountered it. If you want to automatically revisit any nodes, you just need to bump the last_visit_number before starting the crawl.

By your description, I am not quite sure how you are getting stuck.

Edit ------ I just noticed you had a concrete question. In order to increase how quickly you pull in new data, I would keep track of the number of times a given user was linked to in your data (imported or not yet imported). When choosing a user to crawl, I would pick users that have a low number of links. I would specifically go for either the lowest number of links or a random choice among the users with the lowest number of links.

Jacob

answered Oct 04 '22 03:10

TheJacobTaylor

Related questions
                            
                                how to upload chunks of a string longer than 2147483647 bytes?
                            
                                How to split data frame into x and y
                            
                                Static type check for abstract method in Python
                            
                                Auto-PEP8 is adding lines by turning my lambda into def function, how do I disable this specific auto format?
                            
                                How to rename categories after using pandas.cut with IntervalIndex?
                            
                                How do I choose a category page to be the home page for a Pelican site?
                            
                                Airflow - creating dynamic Tasks from XCOM
                            
                                Problem playing audio with playsound on python3
                            
                                For Python2 to Python3 code conversion, Which version of Python & Django best suited?
                            
                                How can I split columns with regex to move trailing CAPS into a separate column?
                            
                                mount google drive in kaggle notebook
                            
                                Specify order of columns in SELECT with UNION using Django ORM
                            
                                Does a python virtual environment avoid redundant installs?
                            
                                Very large python function definition in exec() call crashes Django but does not crash directly executed Python code
                            
                                How can I send a message to someone with telegram API using my own account
                            
                                Does JavaScript use hashtables for Map and Set?
                            
                                Alpha blending two images with OpenCV and/or Numpy [duplicate]
                            
                                How do I melt a pandas dataframe?
                            
                                A python web application framework for tight DB/GUI coupling?
                            
                                How to package Twisted program with py2exe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Good graph traversal algorithm

Tags:

performance

python

language-agnostic

algorithm

graph-traversal

Paul Tarjan

People also ask

2 Answers

P Shved

TheJacobTaylor

Recent Activity

Donate For Us