Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pre-process data with multiple instances against 1 label for neural network tensorflow

I'm training a neural network to predict the number of fan growth of a Facebook page based on the no of posts, the category of post(video, link,status etc) and number of shares ,like & comments for each post. So, there is a single label against multiple instances as label(fan_growth) is being calculated for each day(not for each post): enter image description here

So If i use one hot encoding for categorical data: enter image description here

here date,day,link,video,status,reactions,comments and shares are features, while fan_growth is label. How can I use single label against more than 1 instance? As using '100' again all 1st 3 instances would not be correct.

like image 203
Nargis Avatar asked May 29 '17 16:05

Nargis


1 Answers

If I understand correctly, basically you have a variable number of events that can occur in a given day (posted a video, link, or status zero or more times each), and for each of these you have the associated reactions, comments, and shares. Then you want to predict the fan growth per day based on this variable number of actions taken in a single day. Please correct me if I am wrong.

What you can do is train a recurrent neural network on variable-length sequences of data. You would structure your input data as:

x_ij = [category, reactions, comments, shares]_i for day j
i = 1, 2, ..., n_j (number of posts in day "j")
j = 1, 2, ..., N (number of days in dataset)

You can think of each x_ij as a time step in day j. Then the full input sequence for a single day would look like:

X_j = [x_1j, x_2j, ..., x_nj]

And your output vector would be Y = [y_1, y_2, ..., y_N] where each y_j is the fan growth for day j. Basically the training process then involves setting up your recurrent neural network using tf.nn.dynamic_rnn and using the sequence_length to specify how long each input sequence is. It would look something like this (there are going to be a lot of implementation details that I will skip here):

cell = tf.contrib.rnn.GRUCell(num_hidden)
# Any additional things like tf.contrib.rnn.DropoutWrapper you want here
cell = tf.contrib.rnn.OutputProjectionWrapper(cell, 1)  # only one output number, right?
output, _ = tf.nn.dynamic_rnn(cell, data, sequence_length=sequence_length)

Note that I use GRU cells here (TF docs) instead of LSTM (TF docs). This is partly preference, but basically GRU can do everything LSTM can but is more efficient. You will run your training process then, passing batches of data of size [batch_size, num_steps_per_day, num_features], and a sequence_length tensor of size [batch_size, 1] that gives the number of steps per day. Something like:

with tf.Session() as sess:
  for epoch in range(num_epochs):
    shuffle_training_set()
    for batch in range(num_batches):
      d = get_next_batch()
      t = get_next_target_batch()
      s = # length of each data sample in your batch
      sess.run(optimize, feed_dict={data: d, targets: t, sequence_length: s})
      # periodically validate and stop when you stop improving

Here, optimize might be defined something like:

cost = # your cost function here...
optimizer = tf.train.AdamOptimizer()  # I usually have luck with this optimizer
optimize = optimizer.minimize(cost)

Check out this excellent example (not my content) to get you started on some of the implementation details. This example is showing sequence labeling, but it should be fairly simple to modify it to predict fan growth instead.

like image 147
Engineero Avatar answered Sep 24 '22 00:09

Engineero