So you already have a preloading implemented, and have a queue of single inputs, right?
Take as many of those as your hardware can handle, and stack them on top of each other. So you get an input that is like a mini-batch during training. Let your model work on that.
As you can see in DL4J Classification Speed - #16 by ethiel @torstenbm was able to achieve over 200k classifications per second by batching his inputs.
If you meant the output(DataSetIterator) signature: This more or less just saves you the loop of iterating manually through your iterator at the moment.
But you can still get something close to what you were talking about with Workspaces.
Take a look at https://deeplearning4j.konduit.ai/config/config-memory/config-workspaces#iterators and https://deeplearning4j.konduit.ai/config/config-memory and the examples for them: https://github.com/eclipse/deeplearning4j-examples/blob/master/nd4j-examples/src/main/java/org/nd4j/examples/Nd4jEx15_Workspaces.java