Fine-tuning Bert in DL4J

We are interested in using BERT in DL4J. A couple of questions have come up:

What are currently the possibilities to use BERT in DL4J? The ideal case would be to have a BERT model with pretrained weights in DL4J ready to be fine-tuned. Will this be a feature at some point? For BERT to work natively in DL4J I’m assuming it will have to support multi-head attention layers.

The example here uses a fine-tuned model that is imported with sameDiff.

Is there a way to convert a sameDiff imported model to a DL4J model? If one were to import and fine-tune a BERT model today in DL4J how would one go about it?

When you import a model with SameDiff, you get a computation graph and this graph can be both modified and trained.

So if you want to run some additional fine tuning, you can splice in your fine tuning graph, and continue training in that direction. SameDiffMNISTTrainingExample shows how to train a SameDiff graph.

Unfortunately, SameDiff isn’t as well documented as it deserves at the moment, so if you have trouble to get that suggestion working on your own, feel free to show us what you’ve tried and we will try to help you figure it out.

Thank you for the response. I have some further questions about fine-tuning BERT:

1.) Is there an other possibility to remove the hard-coded dropout layers? Otherwise, it is hard to get test evaluation between each training epoch, since I would always have to remove / add these layers.

2.) I imported a german BERT model from Hugging Face but run into the following problem:

Exception in thread “main” java.lang.UnsupportedOperationException: Please extend DynamicCustomOp.doDiff to support SameDiff backprop operations. Op: org.nd4j.linalg.api.ops.impl.controlflow.compat.Merge
at org.nd4j.linalg.api.ops.DynamicCustomOp.doDiff(
at org.nd4j.autodiff.functions.DifferentialFunction.diff(
at org.nd4j.autodiff.samediff.SameDiff$1.define(
at org.nd4j.autodiff.samediff.SameDiff.defineFunction(
at org.nd4j.autodiff.samediff.SameDiff.defineFunction(
at org.nd4j.autodiff.samediff.SameDiff.createGradFunction(
at org.nd4j.autodiff.samediff.SameDiff.createGradFunction(
at org.nd4j.autodiff.samediff.SameDiff.fitHelper(
at org.nd4j.autodiff.samediff.config.FitConfig.exec(

Is there a way around this?

3.) In the org.nd4j.imports.TFGraphs.BERTGraphTest exists an example to train BERT on the NSP task (BERTGraphTest::testBertTraining). I have some problems to understand the following code part in detail:

    Set<String> floatConstants = new HashSet<>(Arrays.asList(
            "bert/embeddings/LayerNorm/batchnorm/add/y",    //Scalar - Eps Constant?
            "bert/embeddings/dropout/random_uniform/min",   //Dropout scalar values

    Set<String> floatConstants = new HashSet<>(Arrays.asList(

    //For training, convert weights and biases from constants to variables:
    for(SDVariable v : sd.variables()){
        if(v.isConstant() && v.dataType().isFPType() && !v.getArr().isScalar() && !floatConstants.contains({    //Skip scalars - trainable params
  "Converting to variable: {} - dtype: {} - shape: {}",, v.dataType(), Arrays.toString(v.getArr().shape()));
  • Why is the block commented out? What is it for?
  • What are the “bert/encoder/ones” for in the original BERT?
  • Is it bad to transform them to variables? I ask this, because in the German BERT model I imported, the variables and operations have completely different names and I wouldn’t know what the equivalent variables would be there.

Thank you already very much for your answer.


@AlexBlack can you comment here?

As an addition to 2.), I have realized, that this might be a general issue for BERT models that are exported from Keras-Bert. I imported the multi-lingual model from google-research into Keras and exported it as a “.pb” file afterward. For this model I ran into the same problem.

@AlexBlack do you have any thoughts about the three points?

Thank you very much

Apologies for not replying sooner.
We don’t have pretrained BERT models in DL4J, but yes, it’s possible to import tensorflow versions into SameDiff.

The BERTGraphTest will probably not be useful for other BERT implementations, due to different naming structure, ops, etc.

As for the German BERT model from Hugging Face - do you have code to export that as a TF frozen model?
Same thing for Kera-BERT version.
Or better yet, an actual model file.

Happy to take a look at those if I can get some model files. Without model files, it’s hard to say much unfortunately.

Thank you for your response.

Yes, I’ve run this python scrip in Jupyter to generate the frozen graph as a “.pb” file.

Generally, it would also already be fine to load the original tensorflow multilingual base model in DL4J with SameDiff. Either from the original source or from tensorflow hub.

The german model might just be a further improvement, since we are aiming to classify german texts.

How have you initially transformed the BERT model in the DL4J test resources to “.pb”?

In you can see the ways how it is modified.

I also hava some questions about using pretrained Bert model in DL4J/SameDiff (In short, I’d like to train a seq classification/seq labeling model by importing Bert to DL4J).Should I create another post or just describe it here in detail ?

Sorry, I was a little imprecise. I meant, how was the file bert_mrpc_frozen_v1.pb originally generated? How were the publicly available tensorflow BERT models (checkpoints) transformed into frozen graphs in “.pb” format.

If I would know that, I could generate a frozen graph in “.pb” format from the multilingual base model (original paper) my own. And the I could load it into SameDiff and adapt it to my problem in a similar fashion as in the BERTGraphTest

Take a look at

In particular Step 5: Freeze Graph

Thank you for the link. This should me help to go on.