ND4J Error: class path resource [] cannot be resolved to URL because it does not exist

{Moving the issue from GitHub to here. Please find the original post here}

Issue Description

I am trying to use .npy files in my Kotlin code, but when I try to load it with this code (as given in tests and example codes):

import org.nd4j.linalg.factory.Nd4j
import org.nd4j.linalg.indexing.NDArrayIndex
import org.nd4j.linalg.io.ClassPathResource
import java.io.File

fun main() {
    val homePath = System.getProperty("user.dir")
    println("Working Directory = $homePath")

    val fileName = "src/assets/Feedback.npy"
    val file = File("/$homePath/$fileName")
    println("File size = " + file.length())

    val validFile = ClassPathResource("/$homePath/$fileName").file
    println(validFile)
}

I am getting this error:

Working Directory = /Users/vidhey/coding/Kotlin/intellij-sandbox
File size = 11691
Exception in thread "main" java.io.FileNotFoundException: class path resource [Users/vidhey/coding/Kotlin/intellij-sandbox/src/assets/Feedback.npy] cannot be resolved to URL because it does not exist
	at org.nd4j.linalg.io.ClassPathResource.getURL(ClassPathResource.java:269)
	at org.nd4j.linalg.io.AbstractResource.getURI(AbstractResource.java:57)
	at TestKt.main(Test.kt:14)
	at TestKt.main(Test.kt)

Process finished with exit code 1

I am not sure if I’m using the code wrong or if this is some bug.
Though I noticed one thing: the path starts with “Users/vidhey/…” instead of how it should on an absolute path “/Users/vidhey/…”, even though I set the right working directory and passed the proper path.

Version Information

  • ND4J-native 1.0.0-beta6
  • macOS Catalina on MacBook Air 2015
  • Working on IntelliJ IDEA 2020.2.2 Ultimate

I tried using beta7 and beta6 both with the ClassPathResource and with the File object.
With the ClassPathResource, I’m getting the error I showed above. With the File object, the app crashes with some kind of segmentation fault:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000012eaf7614, pid=3819, tid=0x0000000000000c03
#
# JRE version: Java(TM) SE Runtime Environment (8.0_162-b12) (build 1.8.0_162-b12)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.162-b12 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [libnd4jcpu.dylib+0x32614]  lengthForShapeBufferPointer+0x4
#
# Core dump written. Default location: /cores/core or core.3819
#
# An error report file with more information is saved as:
# /Users/vidhey/coding/Kotlin/intellij-sandbox/hs_err_pid3819.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

I have posted the entire core dump log on the issue I referenced above if you want to check it out.

I have been using the same files in Python without any issue, so the files are not corrupted. But I am setting pickle=True in Python when making these files since these are quite big files. Can that be the issue?

Here’s the sample code I’m using to build my npy files in Python. For context: df is a DataFrame I open from a CSV

a = {(df["col1"][i],df["col2"][i],df["col3"][i]) : (df["col4"][i], df["col5"][i]) for i in range(123)}
np.save("path/to/dir/my_file.npy", a)

I say files because I use 4-5 different NPY files in my code.

@vidheyoza you mentioned in your github issue you were using pickle. Could you make sure that’s not the case for all np.save calls and let me know how it goes?

We use the binary numpy format not pickle for loading ndarrays.

So I don’t use pickle=True while saving any of the files, but I still need to do that when I load them elsewhere in other Python code. Is pickle=True a default setting?

@vidheyoza could clarify why you need pickle=True? You say you need that but don’t’ necessarily specify why. np.load/save calls should just be consistent if you use them. I’m not sure if it’s the default or not, I’d just look at the docs for np.save to see: numpy.save — NumPy v1.23 Manual
It appears to be true yes.
I’d suggest shrinking your tests down.

I would advise for any debugging you ever to do to always narrow down the problem, especially so others can reproduce it. When working with interop with any language (eg: python to java) or anything like that always try to understand what each language expects and what the defaults for serialization are.

Right I seem to be getting closer to the issue. np raises an error saying ValueError: Object arrays cannot be saved when allow_pickle=False. Looks like I cannot save my dict based arrays in binary files, and I cannot use pickled files in ND4J.

Thanks for helping me narrow down the problem! I’m new to ND4J so I didn’t know that the library doesn’t support pickled files. Do you have an idea of how I could save dict arrays so that they can be read by ND4J?

Sure no problem! If you need to use dicts, then I would advise separating that out. In nd4j this would be an:

Map<String,INDArray>

You could do a zip file instead. I know its’ not easy to do but at least allows you to load it.
If it helps you, underneath the covers we use: GitHub - rogersce/cnpy: library to read/write .npy and .npz files in C/C++ for loading numpy arrays. The problem with loading a dictionary in nd4j is an INDArray is just one array, not a combination of them or anything.

Each array in the zip file would just be an array with a specific name, then infer the name of the array from the file.

Each file would be: name.npy.
Then when you load it:
You can just use:

FIle f = new File("/path/to/your/file.npy");
String name = f.getName();

From here, just strip out the format for the name following an example like this:
https://www.technicalkeeda.com/java-tutorials/get-filename-without-extension-using-java

That would get you a Map<String,INDArray>

EIther way, make sure pickle=False when saving and save one array at a time.

@agibsonccc you’re a god. I have been stuck on this for an entire day with no progress and suddenly I’m progressing!

If you need to use dicts, then I would advise separating that out. In nd4j this would be an: Map<String,INDArray>

I’m a bit confused as to what you mean. Do I use zip to combine all my NPY arrays?

Also, I use multi-key dicts, so I’m a bit stumped when I come to Java/Kotlin on how to create a multi-key Map object. Should I use a custom data class as a key in the Map?

@vidheyoza the pickle appears to allow numpy to save a set of named ndarrays as a dict.
We can’t do that. We can only load the individual numpy arrays.

Yes, you can use a zip to combine the numpy arrays if you still want to save 1 file, then you would just separate them out.

For the multi key object map, you can use a java library to do that. I would personally stay away from it and keep the names you unique. If you really want a map for that, then look at: collections - MultiValueMap in java - Stack Overflow

So I have a dict created from a single array so in a way my file is a dict of subarrays, but I guess numpy sees it as a dict of separate arrays.

I’m thinking of converting my dict into a single np array so that it is readable by ND4J, but in a way that I don’t lose my functionality of accessing elements from the keys as directly as possible. Will keep this post updated on my results.

Update: I got deeper into the problem. Looks like the problem is less due to the fact of it being a dict and more because it’s not a pure np array. I’m using strings inside the array so technically it’s an object array. Searching for a way around this.

Update 2: So any object other than int and float (and I think complex) is considered as an object by Python, especially a string that is encoded in unicode_. @agibsonccc or anyone else here know on how to work around this? I’m also planning to post this question on the numpy community.

@agibsonccc

Update 3: I’ve tried many different ways to make the file that I needed, I even shifted to other formats like csv and txt but I’m just getting completely stuck at the error that INDArray does not support strings, at least not natively. Is there a way to do that in ND4J or any other library for that matter?

@vidheyoza what is your data exactly? It’s kind of hard to tell what you want without data and a code snippet to reproduce what you want.

What is the best way I can show a data snippet here? I’m a bit reluctant to show the data itself because it is part of IP of an organization and I’m not 100% sure I should put it out on a forum. But I can definitely describe the data here.

In crude terms, one of my data files is a 100x5 spreadsheet which I convert into a dict with 3-column key and 2-column value. Most of the columns in this file (and in others too) have string-based values in them, so I am facing difficulty in importing them into a binary npy file from python.

With this in place, I wanted a way to either load a dict-based data file in Kotlin, or have a spreadsheet based file which I can convert to a Map-like object in Kotlin.

P.S. Sorry for such a vague answer, but I wanted to be a bit careful about this.

You could DM it to me or just make data of the same shape. We’re not looking for your data here, just to help you solve your problem.

Got it. So as I said there are 4 files, and the structure is similar for all of them in that we have a “composite key – composite value” dict made out of it. For e.g., one of the files is of this style:

key1 key2 key3 value1 value2
key1_sample1 key2_sample1 0 value1_sample1 value2_sample1.mp3
key1_sample2 key2_sample2 2 value1_sample2 value2_sample2.mp3
key1_sample3 key2_sample3 1 value1_sample3 value2_sample3.mp3
key1_sample4 key2_sample4 2 value1_sample1 value2_sample1.mp3
key1_sample5 key2_sample5 3 value1_sample2 value2_sample2.mp3

Important things to note:

  1. This is the original spreadsheet format we have, that we convert to a key-value dict in Python and store in .npy.
  2. key1 and key2 are strings, key3 is an number (int to be specific), value1 is a string, and value2 is the audio file path of the same string spoken by a person or a machine (also a string).
  3. The values may be duplicated between rows, but the keys will always be unique (as they should be)
  4. As far as the Python code is concerned, it is just converting a spreadsheet into a dict before saving it into a .npy file. And for the Kotlin code, we want to get value1 and its mp3 audio file name as outputs when we give it a particular composite key as described above.

So I feel like what you should try to do is setup a directory structure that reflects the key/values and indices. Make the directory structure reflect the dictionary structure and put individual numpy arrays that are pure floats underneath that. The main thing is to extract the numpy arrays and save them as npy and make sure it’s numerical data only. From there, you can just write normal kotlin that understands that directory structure.

Making a separate npy file for each row of the spreadsheet sounds like too much of an overkill, no? I could just make a csv and have it read by Kotlin (or any 3rd party lib if needed). It will be longer than the single npy file idea, but still faster than having 100 npy files for 100 rows.

You could, but the main problem is we can only load npy files. You could do a zip file structure if you want and just load entries from that. However you want to sort the arrays is up to you, but overall the main thing is to preserve the numerical numpy arrays

Got it. I found a solution in another 3rd party library, I only have to figure out how to create a map now. Thanks a lot for your help!