Reading Audio files into a CNN network

Hi everyone,
sorry if this is a stupid question, I am coming from reading huge amounts of literature on ANN architectures etc, but am only now really implementing something concrete(I did go through the DL4J examples, but I found little that I could re-use for my use-case)…

I am trying to build a sound recognition tool in pure Java and DL4J, and I am having a whole lot of difficulty finding the right libraries and tools to use. On this Github post, I was directed to the 1.0.0-beta7 version of DL4J. I have been able to read the data into a WavFileReader, but I am at a loss how to convert that into an NDArray or DataSetIterator or DataSet (To be fed into a network). For info, the implementations of the ANN I am considering are LSTM and CNN.

I did manage to build an independent MFCC-extractor that got my data into a 2D NDArray. However the LSTMBuilder didnt like it, so I added a dimension (Not sure if thats the way to go). However the LSTM then complained I did not have any labels. So this would be a second question about loading data. How exactly to load labels into a WavFileRecordReader?

Again sorry if these seem like noob questions.

@arvindangelo could you post more of your code? Your 2D dataset won’t work it would need to be 3d. I can’t go off of it “didn’t like it” I can’t read your screen or your mind.
Please post code and all stack traces so I can provide more specific help.

Hi @agibsonccc ,

Thanks for your reply. It was complaining that there were no labels. I added an NDArray with 14 labels, but now the error is different:

Exception in thread “main” java.lang.IllegalArgumentException: Labels and preOutput must have equal shapes: got shapes [14] vs [2527, 14]
at org.nd4j.common.base.Preconditions.throwEx(

Here is my code:

Class SphinxUtil uses some code I modified from the CMUSphinx to extract MFCCs:

private static void testWithSphinxUtil() {
        File audioFile = new File("H:\\01.PhD\\Voice Recordings\\Uccarana Sutra.wav");
        SanskritPronunciationScorer scorer = new SanskritPronunciationScorer();
        ArrayList<DoubleData> mfccList = SphinxUtil.getMfccs(audioFile.getAbsolutePath());

        double score = scorer.scorePronunciationData(mfccList, mfccList);
        System.out.println("The score for '" + audioFile.getName() + "' is " + score);

The function scorePronunciationData would ideally compare two sets of MFCCs and give the difference(Instead of the real audio labels, I have used 1-14 to test):

    public double scorePronunciationData(ArrayList<DoubleData> mfccExp, ArrayList<DoubleData> mfccStudent) {
        // Extract the MFCC values from the audio files
        double[][][] mfccs1 = MfccTo3D(mfccExp);
        double[][][] mfccs2 = MfccTo3D(mfccStudent);

        // Feed the MFCC values to the model

        DataSet training = new DataSet();

        INDArray train = Nd4j.create(mfccs1);
        // Set the labels
        String[] labels = new String[14];
        //labels[0] = "UccaranaSukta";
        labels = new String[]{"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14"};
        INDArray ndLabels = Nd4j.create(labels);

        INDArray predict = Nd4j.create(mfccs2);
        int[] scores = model.predict(predict);

        // The score for the input is the difference between the scores for the input and expert pronunciation
        double score = scores[0] - scores[1];

        return score;

I understand the problem is with the labels, but I am not being able to give it the labels in the right format.

Grateful for some guidance here.

Kind Regards.

@arvindangelo your dataset creation looks wrong. You have a minibatch size x number of labels there. That looks like you’re only creating one example. Try to update your code to ensure that a proper batch is returned.

Hi @agibsonccc, yes I was thinking I could make it run with the simplest case, try and read only one audio file. Is the issue that my dataset has only one audio file? Or is it that the labels must also be a 3D array? What is the role of the minibatch vs labels?

Again sorry if my questions seem stupid. Im still exploring DL4J.

Kind Regards.

@arvindangelo labels must also be 3d. You’re building a time series. You need a label per time step.

1 Like

Yes I get your point. Iv managed to make a 3D array for the feature vector and the Network now asks for the correct number of inputs, so there is progress.
If I get you right, in a 3D array [30,1,5000], I have 30 inputs(rows) of height 1, in 5000 time steps?
Will I need 5000 labels or 15000 labels?

Hello @agibsonccc,

I made a 3d NDArray that has 30 rows (each MFCC value) as features, and 976 timesteps(matching the extracted timesteps from my test file). The shape is [1,30,976].

I also created another 3d NDArray for the labels. Since I have 14 categories, I have a shape of [14,30,976] for the labels.

I also set the values of column 0 in the labels as 1, to indicate that the features match the first category.

However, when fitting the model, I got the below exception:

java.lang.IllegalArgumentException: Labels and preOutput must have equal shapes: got shapes [14, 30, 976] vs [1, 14]

	at org.nd4j.common.base.Preconditions.throwEx(
	at org.nd4j.linalg.lossfunctions.impl.LossMCXENT.computeGradient(
	at org.deeplearning4j.nn.layers.BaseOutputLayer.getGradientsAndDelta(
	at org.deeplearning4j.nn.layers.BaseOutputLayer.backpropGradient(
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.calcBackpropGradients(
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.computeGradientAndScore(
	at org.deeplearning4j.optimize.solvers.BaseOptimizer.gradientAndScore(
	at org.deeplearning4j.optimize.solvers.StochasticGradientDescent.optimize(
	at org.deeplearning4j.optimize.Solver.optimize(
	at org.deeplearning4j.nn.multilayer.MultiLayerNetwork.fitHelper(

I believe it might be my configuration which is wrong. This is my code to build the LSTM:

 private MultiLayerConfiguration buildLSTM() {
        return new NeuralNetConfiguration.Builder().seed(123)

                .weightInit(WeightInit.XAVIER).updater(new Adam(1e-3)).list()
                .layer(0, new LastTimeStep(new LSTM.Builder()

                .layer(1, new OutputLayer.Builder().nIn(N_HIDDEN)

Any help is appreciated. Note that I didnt implement DatasetIterator, but instead pulled the MFCCs as a float array and then called ND4j.create.

@arvindangelo that tells you exactly what the problem is. You’ll need to ensure that the shapes and the output are the same. If the labels are correct, then fix your output sizes.

I hope im not sounding stupid here, but im having a lot of issues when trying to change the shapes. Any modifications to the shape throw errors either saying the RNN expects a 3D shape, or the inputs dont match the number of inputs on the network.

So basically, I think im dealing with three variables, input shape of dataset, input shape of the network and input shape of the labels, which must all match for the program to run.

Is my understanding that the labels are in shape [1,14] correct? If that is the case I would need some help making the labels match the ANN shape which i believe is [1,30,976] unless im mistaken.

Is there some kind of debugging library that can show me the shapes in real time? Else does DL4J have methods that I can use to compare the shapes while debugging or while running?

Due to lack of time, I had to implement my model in python. I noticed python just extracts the MFCCs to csv and then uses the csv files as inputs to the model. Is that the approach favoured for DL4J? If so, should if be easier to use CSVSequenceFileReader?

@arvindangelo actually you’re using a 2d output layer. If you want 3d you can’t do that. You need to use an RNNOutputLayer.

Regarding the rest, yes you can inspect any and all shapes in the debugger. I can tell you here that your bottleneck here isn’t going to be that though. It’s what I keep mentioning: you don’t seem to know what you want here. If you want time series the do time series (3d) and understand how to set that up.

If you want 2d, then go for 2d and use the appropriate layers.

To save yourself some time, I ask that you spend some time clearly understanding your own objective and stating that here rather than guessing.

Regardless of what you end up with here…googling MFCC java gave me this:

This actually does render 2d. You could just use a library for that.

Regarding the rest…I don’t know why you are doing this but neural networks don’t work the way you’re describing. You don’t let the network tell you what the labels are and you try to guess. You break the problem down 1 step at a time and you figure out how to specify the problem and the network that matches that problem. You don’t let the network tell you what to do.

You need to figure out for yourself what you want and then figure out how to describe it. If your problem doesn’t require time series don’t do that. If it does, then yes you can use the csv approach.

I just ask that you consider learning how to formulate your problem, understand what it is you want and try to reduce your guess work.

Going back to the origin of your question on whether to use cnns or not, you would want cnn1d for that which would use 3d. Your problem should work as long as you have a time series length of 1. Consider our rnn tutorial if you want an overview of time series dat aif you want to work with that:

However, if you do want to go 2d then just go with dense layers and see how well that does first.

Whatever it is you decide, please clearly explain to yourself (and then to me) what your problem requires and what it is you want to do.

1 Like

Hi @agibsonccc, apologies for the late reply as I have been pondering how to explain my problem. My basic task is to score pronunciation of a recordings of a specific language(Sanskrit), using Deep Neural Networks. From some of my reading, research tends to point towards LSTM for this, though CNN seems to work better from my implementations of LSTM and CNN in python.
Since this is a research project, one aspect is to have the same implementations in Python v/s Java to be able to give some metrics on the performance.

Right now it seems I am having a problem loading the labels and matching them to the data.

Some questions i have:

Must the labels be an NDArray, that matches exactly the shape of the data? E.g if I have an NDArray of shape [1,30,976] (30 inputs per step, 976 total time steps), if I have 14 different labels, must my labels be an NDArray of shape [14,30,976] or [1,14,976]?

Does reshaping of NDArrays allow any shape as long as the total number of data elements remains the same?

Im curious to know if there are some visualisation libraries available which could map the labels or model data in 3d space?

Kind Regards

Thank you. Did some reading on that and went through your answer again n suddenly it seems much simpler. Will be back with the results of my tests. :+1:

@arvindangelo if you have python code I’m happy to take a look at it with you. If another framework you’re using does something you expec that we’re not we try to be compatible wherever possible due to our model import features. Please do let me know.

That would be nice. I’m hoping to be able to implement exactly the same thing in Java that i’ve done in Python. It isnt something very complicated, just Perceptron, LSTM, and CNN implementations of phoneme recognition.

It is a research project so once all is working, I will be looking to measure execution speeds on identical data, and try to get some visualisations for inclusions in reports and papers. I’ve seen some interesting visualisation tools in Python, would be interested to know if such tools are available in DL4J.

An option I believe would be feasible would be if we could export the model from DL4J into python-compatible data to be used directly on the python visualisation tools.

Kind Regards,

@agibsonccc Grateful to let me know how we should proceed. Can I email you the python code?

@arvindangelo please DM and I’ll look.

Apologies for the delay. Ok, DM you immediately.