A user can import his data from other websites. All he needs to do is type in his username on the foreign website and we'll grab all pictures and save it into his own gallery. Some of the pictures needs to be transformed with rMagick (rotating,watermarking), that depends on the importer (depends on which website the user chooses to import data from)
We are discussing the sexiest and most flexible way to do so. We are using carrierwave, but we will change to paperclip in case it fits us more.
The current structure does looks like (its roughly pseudocode)
module Importer
class Website1
def grab_pictures
end
end
class Website2
def grab_pictures
end
end
end
class ImporterJob
def perform(user, type, foreign_username)
pictures = Importer::type.grab_pictures(foreign_username)
pictures.each do |picture|
user.pictures.create picture
end
end
end
We struggle with the decision, whats the best return of the importer.
The Importer is returning an array of strings with URLs ["http://...", "http://...", "http://..."]. That array we can easily loop and tell carrierwave/paperclip to remote_download the images. After that, we'll run a processor to transform the pictures, if we need to.
def get_picture_urls username
pictures = []
page = get_html(username)
page.scan(/\/p\/\d{4}-\d{2}\/#{username}\/[\w\d]{32}-thumb.jpg/).each do |path|
pictures << path
end
pictures.uniq.collect{|x| "http://www.somewebsite.com/#{x.gsub(/medium|thumb/, "big")}"}
end
this actually returns an array ["url_to_image", "url_to_image", "url_to_image"]
Then in the Picture.after_create, we call something to remove the Watermark on that Image.
grab_pictures is downloading each picture to an tempfile and transform it. it will return an array of tempfiles [tempfile, tempfile, tempfile]
code for that is:
def read_pictures username
pictures = []
page = get_html(username)
page.scan(/\/p\/\d{4}-\d{2}\/#{username}\/[a-z0-9]{32}-thumb.jpg/).each do |path|
pictures << path
end
pictures.uniq.map { |pic_url| remove_logo(pic_url) }
end
def remove_logo pic_url
big = Magick::Image.from_blob(@agent.get(pic_url.gsub(/medium.jpg|thumb.jpg/, 'big.jpg')).body).first
# ... do some transformation and watermarking
file = Tempfile.new(['tempfile', '.jpg'])
result.write(file.path)
file
end
This actually returns an array of [Tempfile, Tempfile, Tempfile]
The result will be the same for the user - but internally we are discovering 2 different ways of data handling.
We want to keep logic where it belongs and work as generic as possible.
Can you guys help us with choosing the right way? Longterm we want to have around 15 differnt Importers.
To use the module, you have to import it using the import keyword. The function or variables present inside the file can be used in another file by importing the module.
The 4 ways to import a moduleImport the whole module using its original name: pycon import random. Import specific things from the module: pycon from random import choice, randint. Import the whole module and rename it, usually using a shorter variable name: pycon import pandas as pd.
You need to use the import keyword along with the desired module name. When interpreter comes across an import statement, it imports the module to your current program. You can use the functions inside a module by using a dot(.) operator along with the module name.
The __init__.py files are required to make Python treat directories containing the file as packages. This prevents directories with a common name, such as string , unintentionally hiding valid modules that occur later on the module search path.
I've had a similar situation to this recently - I recommend an array of strings for several reasons:
Familiarity: How often are you working with tempfiles? What about the other developers on your team? How easy is it to manipulate strings vs manipulating tempfiles?
Flexibility: Now you want to just process the picture, but maybe in the future you'll need to keep track of the picture id for each picture from the external site. That's trivial with an array of strings. With an array of tempfiles, it's more difficult (just how much depends, but the fact is it will be more difficult). That of course goes for other as-yet-unknown objectives as well.
Speed: It's faster and uses less disk space to process an array of strings than a group of files. That's perhaps a small issue, but if you get flooded with a lot of photos at the same time, it could be a consideration depending on your environment.
Ultimately, the best thing I can say is start with strings, make a few importers, and then see how it looks and feels. Pretend you're a project manager or a client - start making strange, potentially unreasonable demands of the data you've collected. How easy will it be for you to meet those demands with your current implementation? Would it be easier if you were using tempfiles?
I am doing this for a similar project, where I have to browse and get information on different websites. On each of those websites I have to reach for same goal by performing roughly the same actions, and they are off-course all structured differently.
The solution is inspired from the basic principles of OOP:
Main class: handle the high level operations, handle database operations, handle images operation, manage errors
class MainClass
def import
# Main method, prepare the download and loop through each images
log_in
go_to_images_page
images = get_list_of_images
images.each do |url|
begin
image_record = download_image url
transform_image image_record
rescue
manage_error
end
end
display_logs
send_emails
end
def download_image(url)
# Once the specific class returned the images url, this common method
# Is responsible for downloading and creating database record
record = Image.new picture: url
record.save!
record
end
def transform_image(record)
# Transformation is common so this method sits in the main class
record.watermark!
end
# ... the same for all commom methods (manage_error, display_logs, ...)
end
Specific classes (one per targeted website) : handle low lovel operations and return data to the main class. The only interraction this class must have is with the website, meaning no database access and no error management as much as possible (don't get stuck by your design ;))
Note: In my design I simply inherit from the MainClass, but you can use module inclusion if you prefer.
class Target1Site < MainClass
def log_in
# Perform specific action in website to log the use in
visit '/log_in'
fill_in :user_name, with: ENV['user_name']
...
end
def go_to_images_page
# Go to specific url
visit '/account/gallery'
end
def get_list_of_images
# Use specific css paths
images = all :css, 'div#image-listing img'
images.collect{|i| i['src']}
end
# ...
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With