Modifying UCI Example

Hi
I am looking at the UCIClassification Example - UCI Machine Learning Repository: Synthetic Control Chart Time Series Data Set

I have tried to modify the code to input data from 2 files held on my computer
Unfortunately I end up with the features .csv files being row vectors rather than column vectors

Could somebody assist in performing these simple changes to the original code

Thank you

Bob M

Do I understand you correctly that you are trying to reuse the example with your own data?

In that case all you have to do it to ensure that your own data follows the same format as the data used in the example.

You’ll have to provide more details about what exactly you’ve tried for us to help you better.

The features file I am using for input should reflect the original file
The label file is very simple and is loading correctly
This is my modified code

//This method downloads the data, and converts the "one time series per line" format into a suitable
    //CSV sequence format that DataVec (CsvSequenceRecordReader) and DL4J can read.
    private static void downloadUCIData() throws Exception {
        if (baseDir.exists()) return;    //Data already exists, don't download it again

        String url = "https://archive.ics.uci.edu/ml/machine-learning-databases/synthetic_control-mld/synthetic_control.data";
        String data = IOUtils.toString(new URL(url), (Charset) null);

        String[] lines = data.split("\n");

        //Create directories
        baseDir.mkdir();
        baseTrainDir.mkdir();
        featuresDirTrain.mkdir();
        labelsDirTrain.mkdir();
        baseTestDir.mkdir();
        featuresDirTest.mkdir();
        labelsDirTest.mkdir();

        int lineCount = 0;
        List<Pair<String, Integer>> contentAndLabels = new ArrayList<>();
        for (String line : lines) {
            String transposed = line.replaceAll(" +", "\n");

            //Labels: first 100 examples (lines) are label 0, second 100 examples are label 1, and so on
            contentAndLabels.add(new Pair<>(transposed, lineCount++ / 100));
        }

//This method loads the data, and converts the "one time series per line" format into a suitable
    //CSV sequence format that DataVec (CsvSequenceRecordReader) and DL4J can read.
    private static void downloadUCIData() throws Exception {
        if (baseDir.exists()) return;    //Data already exists, don't download it again

        // data
        String url1 = "file:///C:/MATLAB Script Files/SyntheticControl/data.csv";
        String data = IOUtils.toString(new URL(url1), (Charset) null);

        String[] lines1 = data.split("\n");

        // categories
        String url2 = "file:///C:/MATLAB Script Files/SyntheticControl/cats.csv";
        String cats = IOUtils.toString(new URL(url2), (Charset) null);

        String[] lines2 = cats.split("\n");

        //Create directories
        baseDir.mkdir();
        baseTrainDir.mkdir();
        featuresDirTrain.mkdir();
        labelsDirTrain.mkdir();
        baseTestDir.mkdir();
        featuresDirTest.mkdir();
        labelsDirTest.mkdir();

        int lineCount = 0;
        List<Pair<String, String>> DataAndCats = new ArrayList<>();
        for (String line : lines1) {
            String transposed = line.replaceAll(" +", "\n");

            DataAndCats.add(new Pair<>(transposed, lines2));
        }

and the following code is as follows:-

//Do a train/test split:
        int nTrain = 450;   //70% train, 30% test
        int trainCount = 0;
        int testCount = 0;
        for (Pair<String, String> p : DataAndCats) {
            //Write output in a format we can read, in the appropriate locations
            File outPathFeatures;
            File outPathLabels;
            if (trainCount < nTrain) {
                outPathFeatures = new File(featuresDirTrain, trainCount + ".csv");
                outPathLabels = new File(labelsDirTrain, trainCount + ".csv");
                trainCount++;
            } else {
                outPathFeatures = new File(featuresDirTest, testCount + ".csv");
                outPathLabels = new File(labelsDirTest, testCount + ".csv");
                testCount++;
            }

            String trans = String.join("\n",p.getFirst().split(" "));
            FileUtils.writeStringToFile(outPathFeatures, trans, (Charset) null);
            FileUtils.writeStringToFile(outPathLabels, p.getSecond().toString(), (Charset) null);
        }
    }
}

It appears that you are putting a lot of effort in, so you don’t have to understand how data loading works.

Instead of trying to find a way to adapt the UCI data preparation process, let’s try to actually use things as they are supposed to be used.

What kind of data do you have? And what exactly are you trying to do with it?

The data I have is the same as the data in the example withe the following exception

I have shuffled the data so that the training set is a good mixture of labels
I have created a file .csv for the labels and they are loading OK

It is just the features .csv file that is a row vector rather than a column vector

Bob M

The UCI data needs to be in the order that it is or else it isn’t going to produce the correct sequences.

The examples “transposes” the data, to create correct sequences.

If you are creating your data anyway, it would be a lot easier if you just created it in the format the CSVSequenceRecordReader expects, instead of going through the transformation that is needed for the UCI data.

In the original example the data is shuffled
I am trying to create the original data in the format that the CSVSRReader expects

To reiterate:-

I have reproduced the original data in a shuffled state
The original sequence of data had the first 100 records as label ‘0’, the next 100 as label ‘1’ etc.
I have created the labels .csv file to pair witht the original data after it is shuffled…

Bob M

CSVSequenceRecordReader expects each CSV file to represent a single sequence.

The UCI example uses 2 CSVSequenceRecordReader instances to merge features and labels.

When it converts the data which it downloads, it therefore creates two files that look like this :

features0.csv:

feature1,feature2,feature3,...,featureN
feature1,feature2,feature3,...,featureN
feature1,feature2,feature3,...,featureN
feature1,feature2,feature3,...,featureN
feature1,feature2,feature3,...,featureN
feature1,feature2,feature3,...,featureN

labels0.csv

0
0
0
0
0
0

Your code however looks like it creates the labels file as:

0,0,0,0,0,0

All you have to do, is to format the label file you produce correctly, i.e. provide a sequence of labels just as CSVSequenceRecordReader expects them.

My Input files:-
data2.csv is a 600 * 60 file
cats2.csv is a 600*1 file

My output files:-
features → 0.csv is a 1*60 file [which is incorrect!]
labels → 0.csv is a 1*1 file [which is correct]

There are 600 .csv files in each of the features and labels [which is correct]

Bob M

p.s. it is the features .csv files that are wrong !

Because my input file for features is the same as in the original example(I think) , I am at a loss as to why the resulting features .csv files end up as row vectors instead of column vectors ???

Bob M

The UCI example series is a univariate time series i.e only one feature. Looks like your data has more than one feature? You can’t use that code directly then and that is why your feature csv has only 1 column.
If you can describe exactly what format your data is in that would help.

I have described my data - see above
When one looks at the original data - we see 600 records - each with 60 values
I take it that , that means 60 features

When you look at the resulting features .csv files they have 60 values per file as a column vector

Bob M

The 60 values are 60 time steps.

OK
But ‘my data’ is my effort at reproducing the original data
All I am trying to is to replace sourcing the data on the web to sourcing the data from my computer
As well, as the fact that I am shuffling the data at the beginning

Please explain why my adjusted code doesn’t work
Bob M

Can the 60 different values at the 60 time steps not be thought of as 60 features ?

Bob M

In this context? No. This is a RNN example which is expecting a time series. If you wanted to reframe your problem to treat time steps as feature you should be looking at feedforward nets. I would suggest starting with those if this is your first look at DL.

OK - well, lets put all that to one side

Can you advise why I can not achieve my simple requirement of reading in the data from my computer rather than from the web ?

Bob M

We only have the bandwidth to support questions related to the software in the DL4J ecosystem and to some extent general purpose DL questions.

I think the big misunderstanding comes from the way the original data looks:
The original data is formatted as 600 rows.

At UCISequenceClassificationExample line 180 those 600 rows are split.

The for loop in line 193 to 198 then does two things:

  1. it transposes each row into a column by replacing the whitespace between the numbers with new lines
  2. it uses the fact that integer division results only in integer numbers, and that way creates a pair of column and label

Then the shuffle on line 201 shuffles those pairs, because the data is going to be read in linear order later on, and we want shuffled batches.

Finally, the for loop in line 206 to 222 writes the data to output files, when enough training data has been written it writes the test data. For each pair it writes two files:

  • train/features/#.csv contains several lines of numbers, with each line being a single timestep in the sequence with just a single feature
  • train/labels/#.csv contains just a single number on a single line, the label for the whole sequence

This is then later on read by CSVSequenceRecordReader and joined in such a way that the label aligns with the end of the sequence.

As far as I can tell, it works exactly as it should, and you have already achieved that goal.

Semantics - as to whether we have 1 feature or 60 features :slight_smile:

An error still occurs concerning the features .csv files which are row vectors instead of column vectors

The following is my current code (excluding imports) :-

@SuppressWarnings(“ResultOfMethodCallIgnored”)
public class UCI600ClassificationExample {
private static final Logger log = LoggerFactory.getLogger(UCISequenceClassificationExample.class);

//'baseDir': Base directory for the data. Change this if you want to save the data somewhere else
private static File baseDir = new File("src/main/resources/forex/");
private static File baseTrainDir = new File(baseDir, "train");
private static File featuresDirTrain = new File(baseTrainDir, "features");
private static File labelsDirTrain = new File(baseTrainDir, "labels");
private static File baseTestDir = new File(baseDir, "test");
private static File featuresDirTest = new File(baseTestDir, "features");
private static File labelsDirTest = new File(baseTestDir, "labels");

public static void main(String[] args) throws Exception {
    downloadUCIData();

    // ----- Load the training data -----
    //Note that we have 450 training files for features: train/features/0.csv through train/features/449.csv
    SequenceRecordReader trainFeatures = new CSVSequenceRecordReader();
    trainFeatures.initialize(new NumberedFileInputSplit(featuresDirTrain.getAbsolutePath() + "/%d.csv", 0, 449));
    SequenceRecordReader trainLabels = new CSVSequenceRecordReader();
    trainLabels.initialize(new NumberedFileInputSplit(labelsDirTrain.getAbsolutePath() + "/%d.csv", 0, 449));

    int miniBatchSize = 10;
    int numLabelClasses = 6;
    DataSetIterator trainData = new SequenceRecordReaderDataSetIterator(trainFeatures, trainLabels, miniBatchSize, numLabelClasses,
        false, SequenceRecordReaderDataSetIterator.AlignmentMode.ALIGN_END);

    //Normalize the training data
    DataNormalization normalizer = new NormalizerStandardize();
    normalizer.fit(trainData);              //Collect training data statistics
    trainData.reset();

    //Use previously collected statistics to normalize on-the-fly. Each DataSet returned by 'trainData' iterator will be normalized
    trainData.setPreProcessor(normalizer);


    // ----- Load the test data -----
    //Same process as for the training data.
    SequenceRecordReader testFeatures = new CSVSequenceRecordReader();
    testFeatures.initialize(new NumberedFileInputSplit(featuresDirTest.getAbsolutePath() + "/%d.csv", 0, 149));
    SequenceRecordReader testLabels = new CSVSequenceRecordReader();
    testLabels.initialize(new NumberedFileInputSplit(labelsDirTest.getAbsolutePath() + "/%d.csv", 0, 149));

    DataSetIterator testData = new SequenceRecordReaderDataSetIterator(testFeatures, testLabels, miniBatchSize, numLabelClasses,
        false, SequenceRecordReaderDataSetIterator.AlignmentMode.ALIGN_END);

    testData.setPreProcessor(normalizer);   //Note that we are using the exact same normalization process as the training data


    // ----- Configure the network -----
    MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
        .seed(123)    //Random number generator seed for improved repeatability. Optional.
        .weightInit(WeightInit.XAVIER)
        .updater(new Nadam())
        .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)  //Not always required, but helps with this data set
        .gradientNormalizationThreshold(0.5)
        .list()
        .layer(new LSTM.Builder().activation(Activation.TANH).nIn(1).nOut(10).build())
        .layer(new LSTM.Builder().activation(Activation.TANH).nIn(10).nOut(10).build())
        .layer(new LSTM.Builder().activation(Activation.TANH).nIn(10).nOut(10).build())
        .layer(new RnnOutputLayer.Builder(LossFunctions.LossFunction.MCXENT)
            .activation(Activation.SOFTMAX).nIn(10).nOut(numLabelClasses).build())
        .build();

    MultiLayerNetwork net = new MultiLayerNetwork(conf);
    net.init();

    log.info("Starting training...");
    net.setListeners(new ScoreIterationListener(20), new EvaluativeListener(testData, 1, InvocationType.EPOCH_END));   //Print the score (loss function value) every 20 iterations

    int nEpochs = 100;
    net.fit(trainData, nEpochs);

    log.info("Evaluating...");
    Evaluation eval = net.evaluate(testData);
    log.info(eval.stats());

    log.info("----- Example Complete -----");
}


//This method loads the data, and converts the "one time series per line" format into a suitable
//CSV sequence format that DataVec (CsvSequenceRecordReader) and DL4J can read.
private static void downloadUCIData() throws Exception {
    if (baseDir.exists()) return;    //Data already exists, don't download it again

    // synthetic data
    String url1 = "file:///C:/Synthetic Control/data2.data";
    String data = IOUtils.toString(new URL(url1), (Charset) null);

    String[] lines1 = data.split("\n");

    // synthetic categories
    String url2 = "file:///C:/Synthetic Control/cats2.csv";
    String cats = IOUtils.toString(new URL(url2), (Charset) null);

    String[] lines2 = cats.split("\n");

    //Create directories
    baseDir.mkdir();
    baseTrainDir.mkdir();
    featuresDirTrain.mkdir();
    labelsDirTrain.mkdir();
    baseTestDir.mkdir();
    featuresDirTest.mkdir();
    labelsDirTest.mkdir();

    int lineCount = 0;
    List<Pair<String, String>> DataAndCats = new ArrayList<>();
    for (String line : lines2) {
        String transposed = line.replaceAll(" +", "\n");

        DataAndCats.add(new Pair<>(transposed, lines2[lineCount]));
        lineCount++;
    }

    //Do a train/test split:
    int nTrain = 450;   //70% train, 30% test
    int trainCount = 0;
    int testCount = 0;
    for (Pair<String, String> p : DataAndCats) {
        //Write output in a format we can read, in the appropriate locations
        File outPathFeatures;
        File outPathLabels;
        if (trainCount < nTrain) {
            outPathFeatures = new File(featuresDirTrain, trainCount + ".csv");
            outPathLabels = new File(labelsDirTrain, trainCount + ".csv");
            trainCount++;
        } else {
            outPathFeatures = new File(featuresDirTest, testCount + ".csv");
            outPathLabels = new File(labelsDirTest, testCount + ".csv");
            testCount++;
        }

        String trans = String.join("\n",p.getFirst().split(" "));
        FileUtils.writeStringToFile(outPathFeatures, trans, (Charset) null);
        FileUtils.writeStringToFile(outPathLabels, p.getSecond().toString(), (Charset) null);
    }
}

}