Text Classification, need advice

Java_Developer · October 30, 2024, 8:12am

Hi all,

I am trying to classify the text into 4 categories, for the sake of illustration, the categories (labels) are as follows:
-booking
-wheater
-timer
-bestSelling

Now, for each category, I have 120 documents for model training. For example, for the “booking”:

        "I want to secure a flight for my journey.",
        "Can you help me reserve a table for dinner this evening?",
        "Please arrange a hotel room for the weekend.",
        "I need to purchase a ticket for the upcoming festival.",

…and so on

-Those training documents generate a vocabulary of 850 + different words in total.
-Each category generates an equal number of different words.
-Here is configuration I am used:

      MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
               .seed(123)
               .updater(new Adam(0.01))
               .list()
               .layer(new DenseLayer.Builder()
                        .nIn(vocabulary.size()) 
                        .nOut(20) 
                        .activation(Activation.RELU)
                        .build())
               .layer(new DenseLayer.Builder() 
                        .nIn(20)
                        .nOut(15) 
                        .activation(Activation.RELU)
                        .build())

               .layer(new DenseLayer.Builder() 
                        .nIn(15) 
                        .nOut(12) 
                        .activation(Activation.RELU)
                        .build())
               .layer(new DenseLayer.Builder() 
                        .nIn(12) 
                        .nOut(10) 
                        .activation(Activation.RELU)
                        .build())
               .layer(new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
                        .activation(Activation.SOFTMAX)
                        .nIn(10) 
                        .nOut(4)
                        .build())
               .build();

The problem: :
Regardless of the configuration used (number of Dense layres, no of nodes etc…), the performance of the problem is very poor, and the probability of text belonging to categories is almost always the same, for example:

Predicted class probabilities: [0.2515209913253784, 0.2594231963157654, 0.24817872047424316, 0.24087709188461304]

The questions:
-Are 120 documents per category enough for model training?
-What rules should I follow when choosing a configuration?
-Do you see any problem in all of this that I have stated, which is persistently eluding me?

agibsonccc · November 1, 2024, 11:10pm

@Java_Developer what’s the class distribution of each label? You might need to either apply regularization or apply some sort of resampling if you aren’t able to generalize your model well.

I take it you’re using bag of words for the classification?

Topic		Replies	Views
Training CNN error / CNN text classification DL4J	59	2448	May 15, 2021
4 classes were never predicted by the model and were excluded from average precision DL4J	3	1156	April 9, 2020
Warning: 1 class was never predicted by the model and was excluded from average precision Classes excluded from average precision: [1] DL4J	1	1327	June 18, 2020
1 class was never predicted by the model and were excluded DL4J	3	1636	May 11, 2020
Multilayer network DL4J	6	334	August 18, 2022

Text Classification, need advice

Related topics