Best Bet for Clustering

Hey, I’ve been working on a few different networks and practicing with the examples and I think I am getting the overall flow.

However, I’m having trouble finding what type of network/example is the best fit for what I’m hoping to do.

I have a csv file in which I am hoping to cluster the data using ANN. My problem is, that what I am trying to do is not particularly applicable to typical clustering algorithms like K-means, since I don’t have a predefined number of clusters. As such, I don’t have a predetermined “label” for each row.

Is this going to essentially nullify any benefit of deep learning, since it removes the networks ability to properly test the data set afterwards? Is my best bet just to use something like K-means and then experiment with different numbers of clusters?

If your data is clustering on its own already, something like DBSCAN (not implemented in DL4J) may be something you could try.

If your data doesn’t appear to be particularly clustered in its original domain, you should first try PCA on it.

If you want to use deep learning for clustering, you are probably looking for some kind of dimensionality reduction. This means that you should look into using autoencoders to transform your high dimensional data into something lower dimensional.

But then you are still faced with having to use an actual clustering algorithm on it. K-Means has become somewhat of a default baseline, because it is easy to explain and teach, but there are other approaches too.

For more on this, take a look at the following to articles, which were the two top results when googling for cluster algorithms:
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68