Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to write a flexible importer module

A user can import his data from other websites. All he needs to do is type in his username on the foreign website and we'll grab all pictures and save it into his own gallery. Some of the pictures needs to be transformed with rMagick (rotating,watermarking), that depends on the importer (depends on which website the user chooses to import data from)

We are discussing the sexiest and most flexible way to do so. We are using carrierwave, but we will change to paperclip in case it fits us more.

Importer Structure

The current structure does looks like (its roughly pseudocode)

module Importer
  class Website1
    def grab_pictures
    end
  end

  class Website2
    def grab_pictures
    end
  end
end

class ImporterJob        
  def perform(user, type, foreign_username)
    pictures = Importer::type.grab_pictures(foreign_username)

    pictures.each do |picture|
      user.pictures.create picture 
    end
  end
end

We struggle with the decision, whats the best return of the importer.

Solution1:

The Importer is returning an array of strings with URLs ["http://...", "http://...", "http://..."]. That array we can easily loop and tell carrierwave/paperclip to remote_download the images. After that, we'll run a processor to transform the pictures, if we need to.

 def get_picture_urls username
  pictures = []
  page = get_html(username)
  page.scan(/\/p\/\d{4}-\d{2}\/#{username}\/[\w\d]{32}-thumb.jpg/).each do |path|
    pictures << path
  end
  pictures.uniq.collect{|x| "http://www.somewebsite.com/#{x.gsub(/medium|thumb/, "big")}"}
end

this actually returns an array ["url_to_image", "url_to_image", "url_to_image"]

Then in the Picture.after_create, we call something to remove the Watermark on that Image.

Solution2:

grab_pictures is downloading each picture to an tempfile and transform it. it will return an array of tempfiles [tempfile, tempfile, tempfile]

code for that is:

def read_pictures username
  pictures = []
  page = get_html(username)
  page.scan(/\/p\/\d{4}-\d{2}\/#{username}\/[a-z0-9]{32}-thumb.jpg/).each do |path|
    pictures << path
  end
  pictures.uniq.map { |pic_url| remove_logo(pic_url) }
end

def remove_logo pic_url
    big = Magick::Image.from_blob(@agent.get(pic_url.gsub(/medium.jpg|thumb.jpg/, 'big.jpg')).body).first
    # ... do some transformation and watermarking
    file = Tempfile.new(['tempfile', '.jpg'])
    result.write(file.path)
    file
  end

This actually returns an array of [Tempfile, Tempfile, Tempfile]

Summary

The result will be the same for the user - but internally we are discovering 2 different ways of data handling.

We want to keep logic where it belongs and work as generic as possible.

Can you guys help us with choosing the right way? Longterm we want to have around 15 differnt Importers.

like image 289
Tim Kretschmer Avatar asked Aug 12 '15 22:08

Tim Kretschmer


People also ask

Which is the correct way to import modules?

To use the module, you have to import it using the import keyword. The function or variables present inside the file can be used in another file by importing the module.

What are the two ways of importing module?

The 4 ways to import a moduleImport the whole module using its original name: pycon import random. Import specific things from the module: pycon from random import choice, randint. Import the whole module and rename it, usually using a shorter variable name: pycon import pandas as pd.

How do you write an import statement in Python?

You need to use the import keyword along with the desired module name. When interpreter comes across an import statement, it imports the module to your current program. You can use the functions inside a module by using a dot(.) operator along with the module name.

Why do we create __ init __ py?

The __init__.py files are required to make Python treat directories containing the file as packages. This prevents directories with a common name, such as string , unintentionally hiding valid modules that occur later on the module search path.


2 Answers

I've had a similar situation to this recently - I recommend an array of strings for several reasons:

  1. Familiarity: How often are you working with tempfiles? What about the other developers on your team? How easy is it to manipulate strings vs manipulating tempfiles?

  2. Flexibility: Now you want to just process the picture, but maybe in the future you'll need to keep track of the picture id for each picture from the external site. That's trivial with an array of strings. With an array of tempfiles, it's more difficult (just how much depends, but the fact is it will be more difficult). That of course goes for other as-yet-unknown objectives as well.

  3. Speed: It's faster and uses less disk space to process an array of strings than a group of files. That's perhaps a small issue, but if you get flooded with a lot of photos at the same time, it could be a consideration depending on your environment.

Ultimately, the best thing I can say is start with strings, make a few importers, and then see how it looks and feels. Pretend you're a project manager or a client - start making strange, potentially unreasonable demands of the data you've collected. How easy will it be for you to meet those demands with your current implementation? Would it be easier if you were using tempfiles?

like image 64
dax Avatar answered Oct 14 '22 23:10

dax


I am doing this for a similar project, where I have to browse and get information on different websites. On each of those websites I have to reach for same goal by performing roughly the same actions, and they are off-course all structured differently.

The solution is inspired from the basic principles of OOP:

Main class: handle the high level operations, handle database operations, handle images operation, manage errors

class MainClass

  def import
    # Main method, prepare the download and loop through each images
    log_in
    go_to_images_page
    images = get_list_of_images
    images.each do |url|
      begin
        image_record = download_image url
        transform_image image_record
      rescue
        manage_error
      end
    end
    display_logs
    send_emails
  end

  def download_image(url)
    # Once the specific class returned the images url, this common method
    # Is responsible for downloading and creating database record
    record = Image.new picture: url 
    record.save!
    record
  end  

  def transform_image(record)
    # Transformation is common so this method sits in the main class
    record.watermark!
  end   

  # ... the same for all commom methods (manage_error, display_logs, ...)

end

Specific classes (one per targeted website) : handle low lovel operations and return data to the main class. The only interraction this class must have is with the website, meaning no database access and no error management as much as possible (don't get stuck by your design ;))

Note: In my design I simply inherit from the MainClass, but you can use module inclusion if you prefer.

class Target1Site < MainClass
  def log_in
    # Perform specific action in website to log the use in
    visit '/log_in'
    fill_in :user_name, with: ENV['user_name']
    ...
  end

  def go_to_images_page
    # Go to specific url
    visit '/account/gallery'
  end

  def get_list_of_images
    # Use specific css paths
    images = all :css, 'div#image-listing img'
    images.collect{|i| i['src']}
  end

  # ...

end
like image 43
Benj Avatar answered Oct 14 '22 21:10

Benj