Documentation Requests

Hello everyone,

we are working on improving the documentation, and we’d like to collect questions that you want us to address.

So if you have any questions that the documentation isn’t addressing currently, or that you find aren’t adequately addressed, please list them here, so we can take them into consideration when creating the new docs.

Just some questions that i had trouble figuring out just by reading the DL4J docs and maybe needs some clarifications (and maybe i don’t fully get because i just started learning DL4J):

DL4J IN GENERAL

  1. If my data are not in CSVs file what is the best way to build record readers (and sequence record readers)?. Let’s suppose they come from somewhere else (a database, an HTTP service…) in the end i would have data that are lists of numbers (Integer, Doubles, BigDecimals…). For classification i need to build datasets where each element is a list [feature; label] and for regression i need to build datasets from “ordered lists” where each element is itself a list that contains [nth_element; nth+1_element]. Using InMemoryRecordReader could be a solution for small datasets but for large datasets i can’t load all the data in memory.
  2. If i can’t load all the data in memory (and usually i can’t if i use a big dataset) i can create a class that implements RecordReader or SequenceRecordReader and implement respectively their methods List<Writable> next() and List<List<Writable>> sequenceRecord(). In a “naive” approach, i can implement them to produce lists of [feature; label] (or lists of lists of [feature; label] where in regression “feature” is the nth element and “label” is n+1th element) and pass my reader as a parameter for a record reader dataset iterator (example using SequenceRecordReaderDataSetIterator(SequenceRecordReader reader, int miniBatchSize, int numPossibleLabels, int labelIndex, boolean regression)) but in this way i loose some “benefits” that other already implemented record readers supports (batch processing, distributed learning, alignment of variable lenght timeseries…). Maybe there could be more examples in building custom record readers.
  3. If my “elements” are complex types (eg. classes that contains multiple fields) what is the best way to use datavec ETL to transform a list (or multiple lists) of theese elements into a usable DataSet? Especially in the classification problem what is the way to choose parameters (maybe if i don’t know a priori the number of classes and so i cannot estimate them or run other classification algorithms like K-means to have a rough starting point).
  4. Once my model is trained and saved (maybe in a file) what is the best way to just “use it in a black box way” eg. pass an input to the network and get the predicted output in response (eg. i trained an image classifier, i download an image from the internet and i expect that the network classify it outputting a probability vector that i can use to print a single string or maybe the top 3 predicted classes and so on). Should i each time “reinitialize” the network with some training data and then pass my new input? Especially for regression with RNN some more examples could be very helpful

RNNs

  1. In the case of using a SequenceRecordReaderDataSetIterator that produces multiple time series of variable lenght what is the best way to normalize them? I tried to use two record reader one for the features and one for the labels, and used SequenceRecordReaderDataSetIterator(SequenceRecordReader featuresReader, SequenceRecordReader labels, int miniBatchSize, int numPossibleLabels, boolean regression, AlignmentMode alignmentMode) passing AlignementMode.ALIGN_END. Is this the right way? Especially in the case of custom record readers maybe some more examples could be great
  2. Based on examples SingleTimestepRegressionExample and MultiTimestepRegressionExample when i train a RNN for regression at the end of each epoch i should use a RegressionEvaluation to evaluate the differences between predicted data and “real” test data and using the result to further train the network. This is fine but i can’t see a way to “automate” this process using an EarlyStoppingTrainer. If this is possible maybe there could be an example for this?

At last (but this is a matter of a personal preference) to me the tutorials could be organized more clearly to introduce the user in the “DL4J workflow” (“point” at the dataset → extract and transform the fields → normalize inputs → choose what kind of problem i would like to solve → build the corresponding network architecture and configure it properly → train the network → evaluate results). Maybe instead of using the historical “built-in” dataset iterators (MNIST, Iris and so on) using the custom record readers from before could be more helpful in understanding the DL4J specific concepts. Then after the tutorials there could be the sections that goes in deep (no pun intended) with more technical details: common types of layers (FF, CNN and RNN) and their “subtypes”, different activation and loss functions with a description of their purposes, advantages and disadvantages, and after this present the Computational Graph that to me is more of an “advanced” concept and then go full blown with the details on ND4J, DataVec and Samediff that usually new users are not interested in.
For some of these things maybe it’s sufficient to provide links to external sources (eg. Wikipedia for activation functions).
Just as a suggestion there could be in the tutorials a “common thread”, for example a data center that have S servers each with C cpus and hourly they produce a report of load average per cpu (L) and users connected (U) and from that the tutorials starts to introduce problems gradually more complex:

  1. Classify servers for which i have no history (i have but i use it as test data) in classes like “overused”, “underused” or “normal load” based on all history of other servers
  2. Predict if a server is going to be “overused” in the next T hour based on the last few reports
  3. Predict how many servers are going to be “overused” in the next T hour based on the last few reports from that servers
  4. Doing step 3 but supposing that the servers are commissioned at different times so their report sequences have variable lenght
  5. Predict the load of a server (network output is not a class but the actual number) for the next T hours based on the history of the last N hours
  6. And so on…

The datacenter is just an example any scenario could be used (shipping company, stock prices, diseases in a country…)

Sorry for the long post, theese are just my thoughts, hope they could be useful for discussion.

RL4j
1.More docs about RL,some introduction to RL can be added in tutorial
2.Example about how to define custom RL problem using RL4j(including defining MDP,action space,…)