Fine-tuning Bert in DL4J

lln · April 3, 2020, 4:17pm

We are interested in using BERT in DL4J. A couple of questions have come up:

What are currently the possibilities to use BERT in DL4J? The ideal case would be to have a BERT model with pretrained weights in DL4J ready to be fine-tuned. Will this be a feature at some point? For BERT to work natively in DL4J I’m assuming it will have to support multi-head attention layers.

The example here uses a fine-tuned model that is imported with sameDiff.

Is there a way to convert a sameDiff imported model to a DL4J model? If one were to import and fine-tune a BERT model today in DL4J how would one go about it?

treo · April 5, 2020, 10:24am

When you import a model with SameDiff, you get a computation graph and this graph can be both modified and trained.

So if you want to run some additional fine tuning, you can splice in your fine tuning graph, and continue training in that direction. SameDiffMNISTTrainingExample shows how to train a SameDiff graph.

Unfortunately, SameDiff isn’t as well documented as it deserves at the moment, so if you have trouble to get that suggestion working on your own, feel free to show us what you’ve tried and we will try to help you figure it out.

weinino · April 9, 2020, 10:06am

Thank you for the response. I have some further questions about fine-tuning BERT:

1.) Is there an other possibility to remove the hard-coded dropout layers? Otherwise, it is hard to get test evaluation between each training epoch, since I would always have to remove / add these layers.

2.) I imported a german BERT model from Hugging Face but run into the following problem:

Exception in thread “main” java.lang.UnsupportedOperationException: Please extend DynamicCustomOp.doDiff to support SameDiff backprop operations. Op: org.nd4j.linalg.api.ops.impl.controlflow.compat.Merge
at org.nd4j.linalg.api.ops.DynamicCustomOp.doDiff(DynamicCustomOp.java:560)
at org.nd4j.autodiff.functions.DifferentialFunction.diff(DifferentialFunction.java:560)
at org.nd4j.autodiff.samediff.SameDiff$1.define(SameDiff.java:4443)
at org.nd4j.autodiff.samediff.SameDiff.defineFunction(SameDiff.java:3987)
at org.nd4j.autodiff.samediff.SameDiff.defineFunction(SameDiff.java:3972)
at org.nd4j.autodiff.samediff.SameDiff.createGradFunction(SameDiff.java:4183)
at org.nd4j.autodiff.samediff.SameDiff.createGradFunction(SameDiff.java:4090)
at org.nd4j.autodiff.samediff.SameDiff.fitHelper(SameDiff.java:1669)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1591)
at org.nd4j.autodiff.samediff.SameDiff.fit(SameDiff.java:1531)
at org.nd4j.autodiff.samediff.config.FitConfig.exec(FitConfig.java:173)

Is there a way around this?

3.) In the org.nd4j.imports.TFGraphs.BERTGraphTest exists an example to train BERT on the NSP task (BERTGraphTest::testBertTraining). I have some problems to understand the following code part in detail:

     /*
    Set<String> floatConstants = new HashSet<>(Arrays.asList(
            "bert/embeddings/one_hot/on_value",
            "bert/embeddings/one_hot/off_value",
            "bert/embeddings/LayerNorm/batchnorm/add/y",    //Scalar - Eps Constant?
            "bert/embeddings/dropout/keep_prob",
            "bert/encoder/ones",
            "bert/embeddings/dropout/random_uniform/min",   //Dropout scalar values
            "bert/embeddings/dropout/random_uniform/max"
    ));*/

    Set<String> floatConstants = new HashSet<>(Arrays.asList(
            "bert/encoder/ones"
    ));

    //For training, convert weights and biases from constants to variables:
    for(SDVariable v : sd.variables()){
        if(v.isConstant() && v.dataType().isFPType() && !v.getArr().isScalar() && !floatConstants.contains(v.name())){    //Skip scalars - trainable params
            log.info("Converting to variable: {} - dtype: {} - shape: {}", v.name(), v.dataType(), Arrays.toString(v.getArr().shape()));
            v.convertToVariable();
        }
    }

Why is the block commented out? What is it for?
What are the “bert/encoder/ones” for in the original BERT?
Is it bad to transform them to variables? I ask this, because in the German BERT model I imported, the variables and operations have completely different names and I wouldn’t know what the equivalent variables would be there.

Thank you already very much for your answer.

Cheers
Nino

treo · April 9, 2020, 10:30am

@AlexBlack can you comment here?

weinino · April 14, 2020, 3:53pm

As an addition to 2.), I have realized, that this might be a general issue for BERT models that are exported from Keras-Bert. I imported the multi-lingual model from google-research into Keras and exported it as a “.pb” file afterward. For this model I ran into the same problem.

@AlexBlack do you have any thoughts about the three points?

Thank you very much

AlexBlack · April 17, 2020, 1:14am

Apologies for not replying sooner.
We don’t have pretrained BERT models in DL4J, but yes, it’s possible to import tensorflow versions into SameDiff.

The BERTGraphTest will probably not be useful for other BERT implementations, due to different naming structure, ops, etc.

As for the German BERT model from Hugging Face - do you have code to export that as a TF frozen model?
Same thing for Kera-BERT version.
Or better yet, an actual model file.

Happy to take a look at those if I can get some model files. Without model files, it’s hard to say much unfortunately.

weinino · April 17, 2020, 8:25am

Thank you for your response.

Yes, I’ve run this python scrip in Jupyter to generate the frozen graph as a “.pb” file.

gist.github.com

https://gist.github.com/weinino/916baafc2a064b7962358b5583314ce0

gistfile1.txt

import os, argparse
import tensorflow as tf

from keras_bert.bert import get_model
from keras_bert.loader import load_trained_model_from_checkpoint
from keras.optimizers import Adam
from keras import Model

BERT_PRETRAINED_DIR = 'C:/workspaces/bert/data/multi_cased_L-12_H-768_A-12'
# BERT_PRETRAINED_DIR = 'C:/workspaces/bert/data/bert_base_german_cased'

This file has been truncated. show original

Generally, it would also already be fine to load the original tensorflow multilingual base model in DL4J with SameDiff. Either from the original source or from tensorflow hub.

The german model might just be a further improvement, since we are aiming to classify german texts.

How have you initially transformed the BERT model in the DL4J test resources to “.pb”?

treo · April 17, 2020, 8:30am

In https://github.com/eclipse/deeplearning4j/blob/master/nd4j/nd4j-backends/nd4j-tests/src/test/java/org/nd4j/imports/TFGraphs/BERTGraphTest.java you can see the ways how it is modified.

AllenWGX · April 17, 2020, 11:00am

I also hava some questions about using pretrained Bert model in DL4J/SameDiff (In short, I’d like to train a seq classification/seq labeling model by importing Bert to DL4J).Should I create another post or just describe it here in detail ?

weinino · April 17, 2020, 12:35pm

Sorry, I was a little imprecise. I meant, how was the file bert_mrpc_frozen_v1.pb originally generated? How were the publicly available tensorflow BERT models (checkpoints) transformed into frozen graphs in “.pb” format.

If I would know that, I could generate a frozen graph in “.pb” format from the multilingual base model (original paper) my own. And the I could load it into SameDiff and adapt it to my problem in a similar fashion as in the BERTGraphTest

treo · April 17, 2020, 12:47pm

Take a look at dl4j-dev-tools/import-tests/model_zoo/bert at master · KonduitAI/dl4j-dev-tools · GitHub

In particular Step 5: Freeze Graph

weinino · April 21, 2020, 7:40am

Thank you for the link. This should me help to go on.

kgoderis · March 4, 2024, 7:38am

Reading this thread, do I correctly understand that it is currently not possible to import a Tensorflow model and finetune that in Java ?

agibsonccc · March 8, 2024, 2:11pm

@kgoderis you can yes. Use the new API. Model Import Framework - Deeplearning4j

kgoderis · March 9, 2024, 1:06pm

@agibsonccc Adam, is reading a model, and altering it before finetuning an easy feat. with DL4J ? like removing some layers, adding new heads? Idea is to read a hugging face model, and alter that with new layers

agibsonccc · March 11, 2024, 10:37am

@kgoderis The samediff api allows you to remove variables and gives access to the internal op dictionary. Unfortunately the older dl4j api has a finetuning api like that but the newer api for samediff does not.
The upcoming release improves this quite a bit but I"m still WIP on cleaning tests up yet.

In the current M2.1 it’s not the best experience. I’m happy to elaborate further. Please open a new post if you decide to move forward. Thanks!

Topic		Replies	Views
BERT model to Deeplearning4j DL4J	1	789	March 13, 2020
Bert Model in DL4j - for text similarity DL4J	5	275	December 24, 2023
BertInferenceExample, fine tune question DL4J	11	646	July 26, 2020
Decision Transformer	7	162	April 21, 2024
Some error happened when I importFrozenTF by SameDiffI SameDiff	6	658	August 14, 2021

Fine-tuning Bert in DL4J

Related topics