Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to approach Machine Learning problems with dynamically sized input collection?

I'm approaching a problem trying to classify a data sample as good or bad quality with machine learning.

The data sample is stored in a relational database. A sample contains the attributes id, name, number of up-votes (for good/bad quality indication), number of comments, etc. Also there is a table that has items with foreign keys pointing to a data sample id. The items contain a weight and a name. All items together pointing to a data sample characterizes the data sample, which typically could help classify the data sample. The problem is, that the number of items pointing to one foreign key is different for different samples.

I want to feed the Machine Learning input, of e.g. a neural network, with the items that point to a specific data sample. The problem is that I don't know the number of items, so I don't know how many input nodes I want.

Q1) Is it possible to use neural networks when the input dimension is dynamic? If so, how?

Q2) Are there any best practices for feeding a network with a list of tuples, when the length of the list is unknown?

Q3) Are there any best practices for applying machine learning to relational databases?

like image 857
user822448 Avatar asked Dec 06 '12 10:12

user822448


People also ask

Which neural network layer helps in handling variable size inputs?

Fully convolutional neural network is able to do that. Parameters of conv layers are convolutional kernels. Convolutional kernel not so much care about input size(yes there are certain limitations related to stride, padding input and kernel size).

How do you know you've collected enough samples to train your ML model?

This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.

Which is the better technique where training data set is not available?

Synthetic data is used mostly when there is not enough real data, or there is not enough real data for specific patterns you know about. Its usage is mostly the same for training and testing datasets. Synthetic Minority Over-sampling Technique (SMOTE) and Modified-SMOTE are two techniques which generate synthetic data.


1 Answers

There's a field of machine learning called Inductive Logic Programming that deals exclusively with relational data. In your case, if you wish to use a neural network, you would want to transform your relational data set to a propositional data set (single table) - i.e., a table with a fixed number of attributes that can be fed into a neural network or any other propositional learner. These techniques usually construct so-called first-order features, which capture the data from secondary tables. Further, you need to do this only for inducing your learner - once you have the features and the learner, you can evaluate these features for new data points on-the-fly.

Here's an overview paper of some techniques that can be used for such a problem. If you have any further questions, ask away.

like image 88
tempi Avatar answered Sep 30 '22 04:09

tempi