Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multithreaded rake task

I'm writing a rake task that would be called every minute (possibly every 30 seconds in the future) by Whenever, and it contacts a polling API endpoint (per user in our database). Obviously, this is not efficient run as a single thread, but is it possible to multithread? If not, is there a good event-based HTTP library that would be able to get the job done?

like image 619
sleepy_keita Avatar asked Oct 07 '12 10:10

sleepy_keita


People also ask

What is use of rake task?

Rake is a popular task runner for Ruby and Rails applications. For example, Rails provides the predefined Rake tasks for creating databases, running migrations, and performing tests. You can also create custom tasks to automate specific actions - run code analysis tools, backup databases, and so on.

What is environment rake task?

Including => :environment will tell Rake to load full the application environment, giving the relevant task access to things like classes, helpers, etc. Without the :environment , you won't have access to any of those extras.

Is Ruby on Rails multithreaded?

Dissecting Ruby on Rails 5 - Become a Professional Developer Often on a single CPU machine, multiple threads are not actually executed in parallel, but parallelism is simulated by interleaving the execution of the threads. Ruby makes it easy to write multi-threaded programs with the Thread class.

Is Ruby multithreaded or single threaded?

The Ruby Interpreter is single threaded, which is to say that several of its methods are not thread safe.


2 Answers

I'm writing a rake task that would be called every minute (possibly every 30 seconds in the future) by Whenever

Beware of Rails startup times, it might be better to use a forking model such as Resque or Sidekiq, Rescue provides https://github.com/bvandenbos/resque-scheduler which should be able to do what you need, I can't speak about Sidekiq, but I'm sure it has something similar available (Sidekiq is much newer than Resque)

Obviously, this is not efficient run as a single thread, but is it possible to multithread? If not, is there a good event-based HTTP library that would be able to get the job done?

I'd suggest you look at ActiveRecord's find_each for tips on making your finder process more efficient, once you have your batches you can easily do something using threads such as:

#
# Find each returns 50 by default, you can pass options
# to optimize that for larger (or smaller) batch sizes
# depending on your available RAM
#
Users.find_each do |batch_of_users|
  #
  # Find each returns an Enumerable collection of users
  # in that batch, they'll be always smaller than or 
  # equal to the batch size chosen in `find_each`
  #
  #
  # We collect a bunch of new threads, one for each
  # user, eac 
  #
  batch_threads = batch_of_users.collect do |user|
    #
    # We pass the user to the thread, this is good
    # habit for shared variables, in this case
    # it doesn't make much difference
    #
    Thread.new(user) do |u|
      #
      # Do the API call here use `u` (not `user`)
      # to access the user instance
      #
      # We shouldn't need to use an evented HTTP library
      # Ruby threads will pass control when the IO happens
      # control will return to the thread sometime when
      # the scheduler decides, but 99% of the time
      # HTTP and network IO are the best thread optimized
      # thing you can do in Ruby.
      #
    end
  end
  #
  # Joining threads means waiting for them to finish
  # before moving onto the next batch.
  #
  batch_threads.map(&:join)
end

This will start no more than batch_size of threads, waiting after each batch_size to finish.

It would be possible to do something like this, but then you will have an uncontrollable number of threads, there's an alternative you might benefit from here, it gets a lot more complicated including a ThreadPool, and shared list of work to do, I've posted it as at Github so'as not to spam stackoverflow: https://gist.github.com/6767fbad1f0a66fa90ac

like image 184
Lee Hambley Avatar answered Sep 30 '22 05:09

Lee Hambley


I would suggest using sidekiq which is great at multithreading. You can then enqueue separate jobs per user for polling the API. clockwork can be used to make the jobs you enqueue recurring.

like image 33
axsuul Avatar answered Sep 30 '22 07:09

axsuul