Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to structure Machine Learning projects using Object Oriented programming in Python? [closed]

Tags:

I have observed that staticians and machine learning scientist generally doesnt follow OOPS for ML/data science projects when using Python (or other languages).

Mostly it should be due to lack of understanding of best software engineering practises in oops while developing ML code for production. Because they mostly come from math & statistics education background than computer science.

Days when ML scientist develop ad hoc protype code and another software team make it production ready are over in the industry.

enter image description here

Questions

  1. How do we structure code using OOP for ML project?
  2. Should every major task (from picture above) like data cleaning, feature transformation, grid search, model validation etc. be a individual class? What are the recommended code design practises for ML?
  3. Any good github links with well strcutured code for reference (may be a well written kaggle solution)
  4. Should every class like data cleaning have fit(), transform(), fit_transform() function for every process like remove_missing(), outlier_removal()? When this is done why is scikit-learn BaseEstimator be usually inherited?
  5. What should be the structure of typical config file for ML projects in production?
like image 989
GeorgeOfTheRF Avatar asked Oct 28 '17 13:10

GeorgeOfTheRF


People also ask

Do I need object-oriented programming for machine learning?

The use of OOP is entirely optional in Machine Learning as we already have libraries like Scikit-learn and TensorFlow from where we can easily use algorithms. So learning Object-Oriented Programming for Machine Learning is not that necessary, but as a programmer, you should not limit yourself.

Is it possible for Python to be used for OOP object-oriented programming )?

OOP in Python. Python is a great programming language that supports OOP. You will use it to define a class with attributes and methods, which you will then call. Python offers a number of benefits compared to other programming languages like Java, C++ or R.


1 Answers

You are right about one thing being special about ML: data scientists are generally clever people, so they have no problem in presenting their ideas in code. The problem is that they tend to create fire&forget code, because they lack software development craftsmanship - but ideally this shouldn't be the case.

When writing code it shouldn't make any difference what the code is for1. ML is just another domain like anything else, and should follow clean code principles.

The most important aspect always should be SOLID. Many important aspects directly follow: maintainability, readability, flexibility, testability, reliability etc. What you can add to this mix of features is risk of change. It doesn't matter whether a piece of code is pure ML, or banking business logic, or audiological algorithm for a hearing instrument. All the same - the implementation will be read by other developers, will contain bugs to fix, will be tested (hopefully) and possibly refactored and extended.

Let me try to explain this in more detail while addressing some of your questions:

1,2) You shouldn't think that OOP is the goal in itself. If there is a concept that can be modeled as a class and this will make its usage easy for other developers, it will be readable, easy to extend, easy to test, easy to avoid bugs then of course - make it a class. But unless it's needed, you shouldn't follow the BDUF antipattern. Start with free functions and evolve into a better interface if needed.

4) Such complex inheritance hierarchies are typically created to allow implementation to be extensible (see "O" from SOLID). In this case, library users can inherit BaseEstimator and it's easy to see what methods can they override and how this will fit into scikit's existing structure.

5) Almost the same principles as for code, but with people who will create/edit these config files in mind. What will be the easiest format for them? How to choose parameter names so it will be obvious what do they mean, even for a beginner, who is just starting to use your product?

All these things should be combined with guessing how likely is it that this piece of code will be changed/extended in the future? If you are sure something should be written in stone, don't worry about all aspects too much (e.g. focus only on readability), and direct your efforts to more critical parts of the implementation.

To sum up: think about people who will interact with what you create in the future. In case of products/config files/user interfaces it should be always "user first". In case of code, try to put yourself in the shoes of a future developer who will want to fix/extend/understand your code.


1 There are of course some special cases, like code that needs to be formally proven correct or extensively documented because of formal regulations and this main goal imposes some particular constructs/practices.

like image 130
BartoszKP Avatar answered Sep 22 '22 12:09

BartoszKP