Best technique for multi label document classification

Customer of mine is manually processing word documents with configuration data. Those are legal contracts between an insurer and doctors (PCP=primary care provider) where the PCP agrees to deliver certain type of care starting certain date under certain financial conditions. Those documents must be converted into a structural definition in our claims processing system. The contract may state that the PCP will deliver the services for Commercial PPO, but not for Medicare PPO for example. So basically the problem is to map a document to a set of tables L1, L2 … LN.
It is now an time consuming and error prone manual process. I’m working on a PoC to improve this using DL.

What is the best technique to use for this problem? Which example of DL4J is the best starting point?
Complicating factor: a label L may actually be record: not only the code is relevant, but perhaps also a start date and perhaps some amounts.

Topic should be in DL4J category I guess.

I think you should start out with something less flashy. Deep learning is not always the best choice, and especially in cases like the one you’ve presented, it is likely to be unnecessarily hard to get to a working solution.

For this reason I suggest that you start out with just manually coded rules. In situations like yours you will likely have a limited number of input templates, i.e. most of your inputs will be based on just a few templates and have just different things input into them. This should at least give you a baseline that you can use to compare how well other solutions work.

If you insist on using deep learning, the sentence classification example may be a good start. But instead of having a single class output (one-hot encoding) you would use a multi-hot encoding for multiple classes with a Sigmoid activation function.

Alternatively you could look into modelling this as a translation task, where the input is the document, and the “translation” output is a string that outputs your label along with additional information. However, in order to train a model like that successfully, you have to know exactly what you are doing and you will need a lot of compute power.

Thanks for your honest answer. Certainly I will keep this in min. We identified however a number of use cases where AI/ML could be used. So apart from solving a problem, this is also a learning experience. Basically we have the tool (AI/Ml/DL) and searched for a problem :wink:. (I understand this is not the optimal order of working).

I do think however that ML can add value. The documents can be of any format. A certain keyword occurring in paragraph may mean: add this option, while in another paragraph this is just a side note that can be ignored.

Another example: the input is basically free format. So customer may write: “all options except A”.
or "Options A-Z, excluding D.

If I model this as an RNN with proper encoding, the model would learn automatically to distinguish all those variances. Manually coded rules will need too much maintenance and get bounded to a particular format of the document.
My thought was to prevent that by having proper word encoding + RNN.

What do you think?
There does not seem a way to attach a sample document to my message?

This sounds a lot easier than it actually is. Natural Language Processing (NLP) is one of the hardest problems that you can start out with these days. Not only do you have to understand deep learning, but you’ll also have to understand all of the pre-processing, post-processing and ways of scoring.

But at least you get a very explainable solution. If you use deep learning for this problem, you will lose that attribute quickly, if you don’t understand what is happening exactly, and even then it may be hard to explain why exactly it worked that way. Fixing problems when they are noticed, isn’t a simple “well, we’ll add another rule”, it includes a re-training, re-tuning and re-validating often multiple times.

I suggest that you start with something that takes you through a progression of increasingly harder problems, so you can deliberately learn which tool is the best for the job. There are many options from online courses to books. If you really want to stay on the text domain, I like to recommend the NLTK book, as it takes you though all steps and shows you all the different options you’ve got for non-deep-learning ML.

Or if you want something heavier, Pattern Recognition and Machine Learning by Christopher M. Bishop is also a good reference and contains some exercises for you. But you should be comfortable with some heavy math notation.


At first a bit surprised by your answer. The problems you are hinting are generic for a ML/DL approach: having to retrain, difficult to understand why it exactly worked that way. Feels you are making the case of your own tool/approach less attractive. But I appreciate your honesty.

On the theoretical part, I’ve done my preparations: Done Machine Learning + Deep Learning Course at Coursera, went to 3-5 books thoroughly, including Some books where about the theory, others about the practical application. So I’m pretty convinced that I understand the problems ahead.

Your remarks make me change my approach however. Rationale: I do not have sufficient time for a full DL approach + management and customers may not be ready for it. And DL has the disadvantage of lack of traceability and explainability, which is important in my domain (Insurance, where regulations and policies are strict). So doing now a more classical, non DL approach.

Oracle Text (I’m an Oracle guy) can convert Word to plain text. Next step is using the Named Entity Recognition, present in Oracle Text. One can easily define new entity types to be extracted using Extract Rules. After the extraction, a simple Random Forest approach may be sufficient to add learning capabilities.

Going to buy the NLP textbook that you suggested as well.

Thanks, Jaco Verheul

Deeplearning is inherently more difficult to explain than other approaches like decision trees. I want people to be aware of that when they jump into it.

And that is exactly why I say suggested going with something else first, as, other than the already stated benefits, this gives you three additional things:

  1. Something that establishes a baseline that you can use for comparison with later efforts
  2. Something that you can actually use to continue the project
  3. Experience in the domain

If you find that the approach you’re using isn’t satisfying enough, you can always get out the big guns (deep learning).