Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How check if a task is already in python Queue?

I'm writing a simple crawler in Python using the threading and Queue modules. I fetch a page, check links and put them into a queue, when a certain thread has finished processing page, it grabs the next one from the queue. I'm using an array for the pages I've already visited to filter the links I add to the queue, but if there are more than one threads and they get the same links on different pages, they put duplicate links to the queue. So how can I find out whether some url is already in the queue to avoid putting it there again?

like image 232
Fluffy Avatar asked Oct 17 '09 10:10

Fluffy


People also ask

How do I check if an item is in a queue in Python?

To check if an element is in a queue in Python:Use the in operator to check if the element is in the queue. The in operator tests for membership.

How do you check if an element exists in a queue?

Generic namespace provides the Contains() method, which can be used to check if an item exists in the queue. This method returns true if the element is present in the queue, and returns false otherwise.

How do I know if my Python queue is full?

maxsize – Number of items allowed in the queue. empty() – Return True if the queue is empty, False otherwise. full() – Return True if there are maxsize items in the queue. If the queue was initialized with maxsize=0 (the default), then full() never returns True.


2 Answers

If you don't care about the order in which items are processed, I'd try a subclass of Queue that uses set internally:

class SetQueue(Queue):

    def _init(self, maxsize):
        self.maxsize = maxsize
        self.queue = set()

    def _put(self, item):
        self.queue.add(item)

    def _get(self):
        return self.queue.pop()

As Paul McGuire pointed out, this would allow adding a duplicate item after it's been removed from the "to-be-processed" set and not yet added to the "processed" set. To solve this, you can store both sets in the Queue instance, but since you are using the larger set for checking if the item has been processed, you can just as well go back to queue which will order requests properly.

class SetQueue(Queue):

    def _init(self, maxsize):
        Queue._init(self, maxsize) 
        self.all_items = set()

    def _put(self, item):
        if item not in self.all_items:
            Queue._put(self, item) 
            self.all_items.add(item)

The advantage of this, as opposed to using a set separately, is that the Queue's methods are thread-safe, so that you don't need additional locking for checking the other set.

like image 109
Lukáš Lalinský Avatar answered Oct 08 '22 15:10

Lukáš Lalinský


The put method also needs to be overwritten, if not a join call will block forever https://github.com/python/cpython/blob/master/Lib/queue.py#L147

class UniqueQueue(Queue):

    def put(self, item, block=True, timeout=None):
        if item not in self.queue: # fix join bug
            Queue.put(self, item, block, timeout)

    def _init(self, maxsize):
        self.queue = set()

    def _put(self, item):
        self.queue.add(item)

    def _get(self):
        return self.queue.pop()
like image 42
Andrei Tofan Avatar answered Oct 08 '22 16:10

Andrei Tofan