Ontology design patterns – effects on Description-Logic Reasoner performance

This post was triggered by a presentation at the  April 2013 IHTSDO Technical committee meeting in Copenhagen. The presentation covered what I felt were experiences of trying to use ‘novel’ ontology constructs in a large OWL ontology and how it affected the performance of various Description Logic (DL) reasoners. The problem can briefly stated as:

Addition of a relatively small number of ‘universal restrictions’ and ‘disjoint axioms’ disproportionately increased the time required by the DL reasoner to classify the hierarchy (the classification time). The classification times were also unpredictable between various reasoners!

I am not yet sure if the metrics presented can be shared, but I’ll update the post with the metrics if I can. However, returning the main theme of this post, let me explain why this matters to SNOMED CT. The performance of DL reasoners are known to be worst case intractable (ExpTime hard or worse) for some constructs and expressiveness. Tractability of an OWL ontology depends on the size and the complexity of the axioms in knowledge base (ontology). The current expressiveness of SNOMED CT is within the EL++ profile. What this does not include is the use of ‘disjunctions’, ‘negations’, etc among other things. Given this limited expressivity of EL++ there are many DL reasoners that can currently classify all of SNOMED CT in a few seconds. Do not get me wrong, for whatever SNOMED CT seems to do currently, EL++ profile is sufficient. However, as SNOMED CT is increasingly used to support newer use cases or extend existing functionality, it is likely that some of these use cases take us beyond the EL++ profile.

The ‘problem’ as presented above illustrates experiences of such a foray beyond the comfort of the known, to using OWL constructs like universal restrictions and disjoint axioms. The ‘inconsistent and disproportionate’ increase in the classification time is a symptom of something that knowledge engineers in the OWL (& DL world) have recognised for a while. From experience, they can tell you that certain ontology modeling patterns when used in the ontology cause the DL ontology to be ‘intractable’. Please note that the definition of ‘intractable’ can be use case dependant. For some use cases it might mean classification times greater than a few seconds and in others it might mean greater than a few hours (e.g. overnight). Some of these modeling patterns lead to intractability only after the same pattern has been repeatedly used in the ontology. Small ontologies with such patterns might perform adequately during testing. Therefore there is a risk that when small ontologies are scaled up, their performance will suddenly and unexpectedly deteriorate. Currently there exists no theoretical way to predict if a certain modeling pattern will go on to make an ontology intractable when it grows in size. Of course, the logicians will happily point the big pitfalls that you need to stay away from, e.g. using disjunctions. But, what happens if this happens when I’ve added a few thousand axioms? To quote one of my gurus, its the elephant traps that you can’t see that you need to watch out for! So let me rephrase the problem as:

Will my ontology classify in a reasonable time, after I have added a few thousand concepts if I am using a design pattern X or using a logical construct/axiom Y?

The answer to this question is often not straight-forward. It does however indicate two things that ‘naive’ authors of OWL ontologies or implementers might miss:

1. Though every DL reasoner broadly states the supported expressivity profile, the actual implementations differ. So no two DL reasoners are likely to have the same performance just because they are really software programs created by different teams/authors. We all know there is often more than one way to create software code that does the same thing.
2. Within the many intricacies of algorithms that DL reasoners use to classify ontologies are various optimisations that are aren’t necessarily obvious to the ‘user’ who almost treats the DL reasoner as a black box.  For example, a DL reasoner might eliminate the need to look down a given hierarchy, by applying an particular optimisation that might be specific to that given implementation.

In a way, the tendency for implementers is to use reasoners in anger, assuming it all works and of course when things don’t then there is often a tendency to ‘patch’ what doesn’t seem to be working well. As with any new technology, a ‘naive’ implementation of the technology often tends to work well initially and then things begin to go from bad to worse and  more energy is spent on ‘patching’ issues (in this case classification times). I think a better approach is to spend time understanding the technology and its limitations. In the case of implementations that support DL reasoners, adding an ‘OWL API’ interface to plugin in various DL reasons is an elegant engineering design, but does not actually mean that the underlying DL reason can be used just as a black box. I think that if you rely on DL classifiers for your core product functionality, then you might find it useful in the longer run to spend more time understanding how DL reasoners work. An analogy that comes to mind is how ORM technologies (e.g. Hibernate) are portrayed as ‘bad’ or ‘broken’ by naive developers who treat the ORM as an interface only. They do not spend enough time understanding the strengths of the ORM technology and more importantly the relational database that the ORM ultimately uses to store the data. If someone told me they were using Hibernate as an interface to store their data to both a MySQL backend and a PostgreSQL backend, so they can get better performance by querying MySQL for one use case and PostgreSQL for others, then I can only say that sounds like a horrible fudge and I can only wish them luck! So, to avoid going down these unhelpful path of:

‘working okay’ –> ‘broken’ –> ‘patch’ –> ‘working okay’ —> ‘broken again’ —> ‘patched again’ –> ‘works okay for now’…

it might be useful to spend time understanding the technology.

Here are my thoughts on how this could be done:

  •  Avoid using ontology design patterns or OWL constructs or features that are already known to cause issues, e.g disjunctions, disjoints, reciprocal relationships (e.g. Arm has _part some Hand and Hand is _part_of some Arm).
  •  Before attempting to adopt a new modelling pattern, try a few scalability tests to identify metrics.
  •  If possible, identify more than one way to model the ontology to satisfy a given use case and try to measure metrics for all options identified.
  •  Ideally, build a ‘test harness’ that ideally can generate ontologies of incremental size based on a given pattern. The resulting ontologies then can be classified using various DL classifiers, to identify scalability issues. Of course, the key here is ‘patterns’ and not ‘size’ of the ontology.

So how does this relate to SNOMED CT? In my previous post, I mentioned the tendency of naive ontology engineers to over-engineer their models by adding ‘cool’ OWL features. I referred to a proposal to use universal restrictions alone in the SNOMED CT Substance hierarchy to satisfy a use case and why we might want to be careful when introducing these new constructs. I noted that one side effect might be that ontologies becomes hard to be debug. Now imagine a  case where an ontology takes 10 hours to classify and now you notice that there are some ‘inconsistencies’ that need debugging, but the only way you’ll know if you’ve fixed the issue after debugging is to save the changes and reclassify which will take 10 more hours! On a more encouraging note, I found out that we finally have some initial metrics on classification times in Snorocket (the DL reasoner that IHTSDO predominantly uses) when a large number of universal restrictions are added to the ontology. And the good news is, the universal restrictions do not seem to add too much overhead to the classification times! Now, what I’d love to see are the pattern used to generate the data and to corresponding metrics, plotted on a graph published somewhere for everyone to see! Even better, if someone gets down to quantifying the complexity in terms of the big O!

Of course, all this might sound like wishful thinking and the more academic world might be unconvinced about the existence of a real research question in here. I realise this sort of work falls in the realm somewhere between what the logicians consider of interest and the implementers consider of value. I mean most DL reasoner comparisons focus on expressiveness of the ontology and its size, not quite on if there was only one negation axiom or 1000 negation axioms in an ontology with 1100 axioms! Because the addition of a single negation axiom still changes to the DL expressiveness to a the higher order DL profile. For the implementer, it makes more sense perhaps to just use the DL reasoner and patch issues as they arise. However, I know in the past (many years ago) I created a test harness to do kind of testing and identify scalability metrics based on ontology design patters. More recently I came across work done by a research group in the University of Graz doing something similar, so I guess that means I am not entirely barking mad. I haven’t had a chance to look at this work yet, because I’ve been far too busy recently. I’ll also post some of my past work in this area when I get around to digging it up. I think an open question here is if IHTSDO should try to perform such investigations before adopting new modelling patterns and if so, is there value in investing or collaborating in the creation of such a test harness!

End of rant… and its time to shut the laptop now since the flight is about to land in London!