Quickstart using GPS trajectories file from UCI

Thanks Adam. You did not need to do this.

As I think I repeated in every message, I did update the versions.
Clearly you do not believe I did. That’s OK.

The problem lies in the fact that I followed the documentation to the
letter, and did not notice the classifiers.

Thank you

@adonnini apologies. Sorry I’d just appreciate if you’d post exact versions instead of the exact docs.
Sorry it’s just I keep asking for exact code and I haven’t seen your neural networks or any of your build files directly. I get frustrated needing to guess things.

I don’t know why posting exact code
isn’t possible. If you have some sort of NDA then feel free to DM me.

At the end of the day I’m just trying to help you out. I’m not sure what you do/don’t know, hence me trying to provide as many details as possible. I’ll add all this to the documentation later.

Best of luck and let me know when you get everything up and running.

With the new set of dependencies, the size of the apk is now a (more)
manageable 471.9 MB. Installation and launch on the device was successful.

It looks like I am back in business.

Thanks again for your help and the suggestion regarding the multiple apks.

Alex

@adonnini thanks again for confirming. Hopefully we can reduce that further with the nd4j-minimal backend next release. I’ll ensure the docs get updated so we can avoid this in the future. Sorry for the confusion. That’s definitely in need of some work.

Please do ask me if you’re blocked on anything else. In the future I’d still appreciate exact code so as to avoid ambiguity.

Thanks. Please let me know about the nd4j-minimal backend next release.

Hi Adam,

My attempt to run the network in my application is failing with the
following error:

“java.lang.IllegalStateException: 3D input expected to RNN layer
expected, got 2”

I did some searching for this error and found some relevant links. One
of them is a stackoverflow post,

-deep learning - How to use LSTM with a 2d array with DeepLearning4j - Stack Overflow

, which you answered:

"You have to convert the time series to be 3d using reshape. If a time
series is of length 1, then just make sure the time series reflects
that. See: -deeplearning4j.konduit.ai/models/recurrent
https://deeplearning4j.konduit.ai/models/recurrent- "
The link in your response is dead

Is your answer above still valid? When you say that the time series
should be converted to 3d using reshape, do you mean the input INDarray
(e.g. arrData in the following:

INDArray results =restored.output(arrData);

should be the one reshaped?

Thanks,

Alex

P.S. I added characters at the start and ned of the links above because
rules of the forum prevent me from including more than one link in a message

Hi Adam,

My attempt to run the network in my application is failing with the
following error:

“java.lang.IllegalStateException: 3D input expected to RNN layer
expected, got 2”

I did some searching for this error and found some relevant links. One
of them is a stackoverflow post,

, which you answered:

"You have to convert the time series to be 3d using reshape. If a time
series is of length 1, then just make sure the time series reflects
that. See: deeplearning4j.konduit.ai/models/recurrent "
The link in your response is dead

Is your answer above still valid? When you say that the time series
should be converted to 3d using reshape, do you mean the input INDarray
(e.g. arrData in the following:

INDArray results =restored.output(arrData);

should be the one reshaped?

Thanks,

Alex

@adonnini ensure you compare your training pipeline to the way you’re running inference. If you’re doing time series and using those iterators that should be 3d.

For machine learning (apologies again you haven’t shown me any code so I don’t know what you do/don’t know), going off of what you’ve said though: neural networks in particular it’s crucial that your inference pipeline be exactly the same as your training.

That includes normalization of your data, the shape of your input and and everything.

RNNs as a matter of course only work with 3d input. So no matter how you get there you should always be doing 3d.

I made a mistake in the construction of the input dataset in my application.

I realized that I need to take this approach (from your documentation):

     double[] flat = ArrayUtil.flattenDoubleArray(tmpData);
     int[] shape = new int[] {tmpData.length, 6};    //Array shape here
     arrData = Nd4j.create(flat,shape,'c');

I think that the shape definition needs to be changed to include a
timeseries length parameter, and my tmpData array (where each record is
a location point) needs to be three-dimensional.

I am a little embarrassed. I am not sure about the meaning of

timeseries length

and

numExamples

My dataset has about 35,000 records. Each records has the following 6
columns:

id,latitude,longitude,time,track_id,geohash

so, the shape should be:

[35000,6,??]

or

[35000,6,35000]

I looked for examples/definition of the terms timeseries lenght and
numexamples in the documentation. I could not find any.

Could you please clarify?

Thanks,

Alex

@adonnini thanks for giving me a detailed description of your problem. Could you clarify how you trained your model? We have a default RNN format. Take a look at this overview for concepts first:

The basics of time series is your data is 3d. I mentioned this before. 3d models work on a fixed time series length, number of features, and number of examples at a time.

Your time series length is something you need to set. If you want to do 1 you can. Usually you would encode that as some fixed interval.

A simpler example would be in words. You would have n words at a time as your time series length.
You could translate this in your dataset as per move if you want to forecast n moves ahead or even per minute or something.

That’s up to you to transform your data first htough.

My code follows pretty much to the letter this:

which you pointed me to

I have read the documentation you pointed me to below several times.

I have also been trying to follow the instructions in

Based on what the document you pointed to below says, in my case:

The number of inputs is 6 (the number of columns
(id,latitude,longitude,time,track_id,geohash)

The number of time series (examples) is 35,000 (the number of location
point records)

The number of time steps can be any number I choose, if I understand
what you wrote below?

In the example in the document you pointed to below, it states:

“For example if you have five features in time series, each with 120
observations, and a training & test set of size 53 then there will be
106 input csv files(53 input, 53 labels). The 53 input csv files will
each have five columns and 120 rows. The label csv files will have one
column (the label) and one row.”

the five features is equivalent to the 6 columns iin each record in my
dataset, right?

The 120 observations is the equivalent to my 35,000 records??

Where does the number of training and test set size come from?

Thanks,

Alex

@adonnini yes, the number of time steps is encoded in your csv files for training and should be a conscious decision and be known for when yod o i nference.

Note that when you do training if you do do regression then one of those columns might actually be your variable you’re trying to predict. In that case you might have only 5 variables. Consider that when you’re setting up your dataset.

If your label isn’t known yet then you need to set that up in the data set as well.

Thanks Adam. This helps.

When you say “If your label isn’t known yet then you need to set that up
in the data set as well.”

what do you mean? Do you mean that the value of the record in the
one-column one-row label files should have an arbitrary “dummy” value?
Or, should be?

Until now, since I do not know the value of the output a priori, each
feature file (with X track points all with the same Y geohash) has a
corresponding label file whose one record is the Y geohash.

What do you think?

Thanks,

Alex

@adonnini for regression you’ll need 1 or more predictor columns. For classification you’ll need an index. That is all per row in each CSV.

You’ll need to figure out something or create a training set to do that unfortunately. What you’re doing is supervised learning. Supervised learning needs labels and a loss function.

Unsupervised learning (clustering you’ve mentioned) doesn’t need that.

Thanks.

Let’s forget about classification. I am definitely not going down that
road given the application.

I am aware of supervised vs. unsupervised learning.

A basic question. Does the fact that I do not know the value of the
labels a priori mean that I necessarily need to switch to using
unsupervised learning?

If I switch to unsupervised learning, does it mean that I need to
abandon regression? Could you point me to dl4j documentation on
unsupervised learning?

Thanks,

Alex

@adonnini regression is by definition supervised learning and supervised learning only.

In general you should be building a training set with targets that predict the direction of given your current coordinates and other things. That seems doable.

This goes back to me saying to you that you need to determine your objective again and ensure you understand the problem you’re trying to do.

You can’t just toss some columns in to a neural net and hit “run” like you can with clustering. Clustering only works off of similarities in other data points.

However, as you’re probably aware even k nearest neighbors has a fit(…) function with labels where you have to give labels to each data point in order for it to know what the label for your data point is.

I’m going to recommend again to step back a bit and ask you to clearly state your problem.

If you want to predict the upcoming location of a data point and have prior techniques for that maybe use that to generate a labeled dataset you can put in to a neural network?

Either way you need to have a clear target. Regression has a target variable you’re trying to predict. That variable is always known. That’s how you fit a curve and determine a loss. The loss being the difference between the algorithm’s output and the real data point.

If you want to do unsupervised learning all that will do is enable you to learn what are called embeddings. Embeddings are just a vector that represent a data point that are determined by using SGD to learn the similarities in a dataset similar to clustering.

That still won’t help you predict a concrete target though.

In a way, we are going around in circles. And, probably it’s my fault.

Respectfully, the issue is not one of problem definition, as I think I
have stated many times. My messages probably are too long or too
convoluted. As I have said many times, the problem definition is
straightforward, and one which I have already resolved. The current
solution lacks a learning function which, instead, is an integral part
of a neural network based solution. To repeat myself:
Given a sequesnce of track points, I want to find the next track point
and ultimately he destination of a trajectory (which in my terminology
means a sequence of track points).

The key issue and major source of uncertainty on my part is how to
define features and labels for input into a neural network, and which
neural network is best suited to help me achieve my goal.

A recurrent regression neural network seems to be a natural fit with
clearly defined dependent (next track point) and independent (lat, lon,
time) variables. For your reference I would point you to the following:

In this paper, the author describes the implementation of an application
similar to the one I am attempting to implement, unless I am mistaken.

At the same time, I would like to understand whether a neural network
using clustering might work although I still do not understand how
clustering is used by a neural network. I still have not found an
example in the dl4j documentation.

Please note that clustering in and of itself is not a solution to the
next location problem. It is a tool to help get there. at least, that’s
how I currently use it in my application. Please also note that a
clustering method like DBSCAN is probably better than K-Means, and
clustering should also have a temporal component.

FYI, I removed the dependent variable (label) from the input dataset,
and I corrected the 3d error. Now, execution of the neural network I
have implemented at least mechanically runs successfully.Here is a
sample output which I ma having a hard time interpreting:

03-13 07:38:45.953: I/ClusterProcessing(5332): -
neuralNetworkloadAndRun - restored.summary() -
03-13 07:38:45.953: I/ClusterProcessing(5332):

@adonnini so let’s agree you need labels.

Pardon me for assuming, but based on your description you don’t know what those are and need help with that. Let’s reword your problem to that.

I’ve said numerous times to pick a column to use as a label and I haven’t seen you say “this one?”.

Regarding this:
FYI, I removed the dependent variable (label) from the input dataset,
and I corrected the 3d error

Put the dependent variable back. You need that as your label.

Regarding:
next location problem. It is a tool to help get there. at least, that’s
how I currently use it in my application. Please also note that a
clustering method like DBSCAN is probably better than K-Means, and
clustering should also have a temporal component.

Kmeans only requires you to define the number of clusters you want. DBScan is similar. Those are both unsupervised methods only working off the data.

Neural networks (and any machine learning algorithm that does regression/classification) aka supervised learning needs those dependent variables.

A neural network will learn the pattern based on the variables that you tell it to. It’ll say “map these input columns to correlate with this output column”.

Based on the paper:
Table 3. Sample Input
Input
[0.5, 0.8660254, 0.43388374, -0.90096887],
[0.5, 0.8660254, 0.43388374, -0.90096887,]
Output
[39., -77.00935364]

In this case you’re forecasting coordinates. Therefore you’ll want the final destination as the output variable for regression. Those are the variables you want to predict.

For the given application any row of your CSV would thus contain:
[0.5, 0.8660254, 0.43388374, -0.90096887, 39.0, -77.00523376],
[0.5, 0.8660254, 0.43388374, -0.90096887, 39.0, -77.00798035]

From the looks of it these would be 2 timesteps.

Pardon me but I’ve only skimmed the paper looking for a proper problem definition. This gave me what I was looking for.

Note that for the loss function they used. RMSE. IN dl4j that will be RMSE_XENT.

Sorry I know this was frustrating but literally all I was looking for was:
I’m using LSTMs to predict an x and y coordinate/ or lat/long and a paper I’m referencing appears to use a timestep of 2. How do I set that up?

In your case, clustering actually should work for that at least somewhat well since the problem is actually very much a nearest neighbors problem. I could see where you’d want to use LSTMs for that though. Just note that an LSTM in order to learn the pattern will need some sort of labels.

I use lat,lon as the label (dependent variable). Becuase the regression
model requires a label file with one column and one row, I use the
geohash of lat,lon in long format.

If my understanding is correct, recurrent regression requires separate
label and feature files.

After I calculate the the geohash of the lat/lon in a feature record I
create the label record corresponding to the feature record containing
the geohash in long format. This is why I don’t think I need to have the
geohash in both the feature and label files (I think that would be a
mistake). So, in a sense, I did not remove the label (lat,lon) from the
feature files. I removed the geohash of the lat,lon

With regards to
“In this case you’re forecasting coordinates. Therefore you’ll want the
final destination as the output variable for regression. Those are the
variables you want to predict.”
This is what I m trying to do, initially the next track point not
necessarily the “final” destination.

With regards to
“From the looks of it these would be 2 timesteps.”
Right, in my case the input dataset has more thna 35,000 records. Am I
confusing the number of records in the input dataset with the number of
time steps? If I am, Could you please clarify?

With regards to
“In your case, clustering actually should work for that at least
somewhat well since the problem is actually very much a nearest
neighbors problem. I could see where you’d want to use LSTMs for that
though. Just note that an LSTM in order to learn the pattern will need
some sort of labels.”

Is there an example of dl4j using clustering I could take a look at? I
am not sure how to start an experiment using clustering in dl4j.

Thanks,

Alex

With regards to
“In this case you’re forecasting coordinates. Therefore you’ll want the
final destination as the output variable for regression. Those are the
variables you want to predict.”
This is what I m trying to do, initially the next track point not
necessarily the “final” destination.

With regards to
“From the looks of it these would be 2 timesteps.”
Right, in my case the input dataset has more thna 35,000 records. Am I
confusing the number of records in the input dataset with the number of
time steps? If I am, Could you please clarify?

The number of time steps isn’t your full dataset. The number of timesteps in your case would be the number of timesteps you want to forecast any any given point like 1 or 2. In this case it’s 2 in the dataset. What the paper is doing is: use 2 timesteps to forecast the coordinates at the specified lat/long

With regards to
“In your case, clustering actually should work for that at least
somewhat well since the problem is actually very much a nearest
neighbors problem. I could see where you’d want to use LSTMs for that
though. Just note that an LSTM in order to learn the pattern will need
some sort of labels.”

Is there an example of dl4j using clustering I could take a look at? I
am not sure how to start an experiment using clustering in dl4j.

Apologies, we did used to have a lot more examples but removed them because they were more confusing than helpful. We do have some laying around in the tests though.

Here’s the basic example on how to use a sequence iterator:

  int miniBatchSize = 10;
            int numLabelClasses = 6;

            File featuresDirTrain = Files.createTempDir();
            File labelsDirTrain = Files.createTempDir();
            new ClassPathResource("dl4j-integration-tests/data/uci_seq/train/features/").copyDirectory(featuresDirTrain);
            new ClassPathResource("dl4j-integration-tests/data/uci_seq/train/labels/").copyDirectory(labelsDirTrain);

            SequenceRecordReader trainFeatures = new CSVSequenceRecordReader();
            trainFeatures.initialize(new NumberedFileInputSplit(featuresDirTrain.getAbsolutePath() + "/%d.csv", 0, 449));
            SequenceRecordReader trainLabels = new CSVSequenceRecordReader();
            trainLabels.initialize(new NumberedFileInputSplit(labelsDirTrain.getAbsolutePath() + "/%d.csv", 0, 449));

            DataSetIterator trainData = new SequenceRecordReaderDataSetIterator(trainFeatures, trainLabels, miniBatchSize, numLabelClasses,
                    false, SequenceRecordReaderDataSetIterator.AlignmentMode.ALIGN_END);

            MultiDataSetIterator iter = new MultiDataSetIteratorAdapter(trainData);

Found here:

Train test data for associated test here: