Is this the right approach to the problem? multilabel classification / multiple output layers

Suppose I work for a national pizza company. We have a large chain of 24/7 pizza restaurants. I want to predict when any given customer is most likely to buy a pizza. These are all people that have ordered from us before. They can order pizzas multiple times a day.

  1. I divide the 24 hour day into twelve 2 hour time slots.

  2. For training data I create an input record for each customer. Each line looks like this:
    1,0,4,0,0,1,0,0,0,3,0,0
    which a sum of the number of pizzas the customer has ever bought in each time slot. 1 pizza in time slot 0, 4 pizzas in time slot 2, etc

  3. The training labels look like this
    1,0,1,0,0,1,0,0,0,1,0,0
    which means that it is 0/1 for each time slot, has that customer ever ordered a pizza in that time slot before.

  4. The model looks like this, a simple classifier with 12 output layers, each corresponding to a time slot. It is possible that a customer can order multiple pizzas in a day after all. The output layers are binary just a yes/no on whether they would buy a pizza at that time slot.

    ComputationGraphConfiguration conf = new NeuralNetConfiguration.Builder()
    .updater(new Sgd(0.01))
    .graphBuilder()
    .addInputs(“input”)
    .addLayer(“L1”, new DenseLayer.Builder()
    .nIn(columnsInput)
    .nOut(layer1Size)
    .weightInit(WeightInit.ZERO)
    .activation(Activation.SIGMOID)
    .build(), “input”)
    .addLayer(“out1”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out2”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out3”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out4”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out5”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out6”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out7”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out8”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out9”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out10”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out11”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .addLayer(“out12”, new OutputLayer.Builder()
    .activation(Activation.SIGMOID)
    .lossFunction(LossFunctions.LossFunction.XENT)
    .nIn(layer1Size).nOut(2).build(), “L1”)
    .setOutputs(“out1”, “out2”, “out3”, “out4”, “out5”, “out6”, “out7”, “out8”, “out9”, “out10”, “out11”, “out12”)
    .setInputTypes(InputType.feedForward(batchSize))
    .build();

after training the model will take any new customer pizza ordering record like the training record and return a probability for each time slot (output layer).

Does this model make sense? Or should I have configured it differently? Imagine a scenario where a customer might be presented with a coupon at any time slot where it is predicted they might buy a pizza. Obviously this should just do a good job of just encoding their purchase history but it can be then built up on, adding in additional features such as geography, age, etc.

And to be clear, the problem I am working on is simplified here due to confidentiality. This is not actually related to real life pizzas, but the issues are otherwise identical.

One of definitions of neural networks:

If y = f(x), neural network is able to approximate given f function, with y and x being provided.
With this said, the model you have and the approach you’re trying has no real sense to me.

I.e.: Your only features are time slots. Okay. So we are assuming that all days of the week are equal? All months/seasons are equal as well?

Then, your input has no information about user himself. All information you’re going to provide is number of orders made previously in certain time slots. Okay. Let’s consider that feature vector. But how neural network is going to know what’s the current time? And that’s important question, since your hypothesis says that the only differentiator is current time slot…

On other hand, if all you want is a prediction that somebody had ordered something in a given time slot - it would be just stupid, because your features explicitly contain this information. So all you need is simple IF in your SQL query rather than neural network. Say, return 1 if count(slot) > 0 else 0; in python.

P.S. You don’t need 12 independent outputs. You might have one output vector, with 12 classes. However it wouldn’t change the problems explained above, and will just make the model more human-friendly.

Well, that’s significant change in problem formulation, but not significant enough: information about potential buyer is still very sparse.

It might have sense to reformulate this problem to volume prediction per restaurant. But it will still depend on how accurate you was on building the analogy. In such field as model design every bit of information matters.

Also, it might be a better idea to start with something that is explainable by its nature (e.g. decision trees) to get a proper baseline first.

With this in mind, your proposed input can not be used to come to a satisfying solution. As you have the answer in your input, the network will learn to just use that information, and ignore everything else. When you have a new customer, the history will be empty, so it will tell you at best that the customer is unlikely to order anything, even if their profile otherwise indicates that they are very like other customers you have.

Machine Learning lacks common sense. You have to apply it yourself, because any model will try to cheat the system and use answer that you give it as part of the question.