Why are SNOMED CT codes (Identifiers) meaningless numbers?
As a team of terminology and informatics experts supporting legacy systems (and their migration), we often come across a few common themes:
- We love ICD, Read, [insert your favourite coding standard here] cos we know what the code means! Why are SNOMED CT identifiers just meaningless numbers?
- Read or xxxx was such a great system from the past, me and my users/devs could enter a code directly — cos we knew the code for asthma, pneumonia, etc. Can’t you just keep Read/PBCL going? If only NHS Digital did not discontinue the Read Codes!!
This is the third part in our SNOMED CT: An Introduction(check out part 1 and part 2 here) series, this article explains why SNOMED CT uses seemingly meaningless codes (identifiers) that are just numbers. In order to understand why this happened, you need to understand the history of SNOMED CT and one of its indirect precursors called Read Codes.
> TLDR (or TL;DR)
The elegance of the Read codes was their simplicity and like all elegantly simple systems, its downfall is also this simplicity. Read codes carry meaning in codes, but the issue with carrying meaning in `codes` is that soon you run out of `space (computational)` to fit the right thing in the right place. For those of you who are mildly technical, this is similar to the transition the Internet is going through at the moment with IP addresses. We can no longer fit the number of available machines (those pesky IoT devices and phones) into IPV4 tables, so we need to move to IPV6, no matter how painful this is.
Background
For those of you who aren’t from the UK and wonder what on earth are Read Codes, it is one of the precursor coding systems that eventually became SNOMED CT. For those who are too busy to read the whole story of coding systems and their evolution from back in the 1990s to SNOMED CT, here is a plotted summary:
When the idea of coding was kicked off, James Read who popularised the notion created what are called the Read Codes in the UK. The original Read Codes were 4 bytes long, so those were technically Version 1. When the Read Codes were nationally adopted, to fit more `codes`, they were turned into the 5 byte version, which became Read Codes V2. A few years after national adoption, the Read Codes were enhanced/improved to create its successor, called `NHS Clinical Terms` (Version 3). These are commonly referred to as `CTV3` — also known as version 3 of Read Codes. Anyway, long story short, it is `Clinical Terms` that were merged in SNOMED RT (yes RT stands for Refined Terms) and to recognise the contribution of CTV3, the newly merged standard was called `SNOMED CT` as we currently know it. If you want a more elaborate history, then check out SNOMED International’s education material.
It is just a code aka an identifier!
It is useful to remember that most `coding systems` have a notion of:
- Identifier — the alphanumeric or numeric identifier that is used to identify the thing (similar to database ids)
- Rubric/Label — the human readable phrase (name) given to the thing
When we say `Read Codes`, people could be referring to both the `rubric` or the actual `identifier` depending on the context. This is because there is some `meaning` hidden/packaged into the identifiers in Read Codes. The Read V2 hierarchy is embedded in the code e.g. A1… is a descendent of A….; A11.. is a descendent of A1… etc. Here is a sample hierarchy:
Read Codes are five characters in length, composed of combinations of upper and lowercase English alphabet characters and single digit numbers (i.e. A-Z,a-z, 0–9).
This simple system has some advantages, for example it can:
- You can look at the starting letter or a few letters and guess what it might be about
- You can also guess the position of the concept in its hierarchy, based on the number of digits in its `code`. For comparison, here are two codes, with their corresponding SNOMED CT equivalents. You can by looking at the two Read Codes you know they are related but you couldn’t draw the same inference by looking at the corresponding SNOMED CT codes (identifiers).
- This meant that for the naive SQL based systems, one could write a query like:
`select * from PATIENT_EPISODE_TABLE where ENTRY_CODE like G2%`
- To retrieve all patients with Hypertension, you could write the following query:
select * from PATIENT_EPISODE_TABLE where ENTRY_CODE like `G2…, G20..%, G24.. — G2z
Limitations
However this simplistic system has limitations. These limitations are computation limits and not something that humans have enforced. Following these rules means
- No more than 62 descendants of a node can be modelled.
- The maximum number of levels within any hierarchy is five.
With these known limitations, the Read codes were adopted nationally across all primary care systems in the 1980s. While you might laugh now, remember these were days when computer hardware specifications were counted in `bits` and `bytes` and `kilobytes` was the equivalent of today’s `gigabytes`. So being able to parsimoniously store and transfer coded clinical data was a pressing need of the day! Also, the original Read Codes were 4 byte (aka 4 characters), so in anticipation of national needs, they pushed the limits to 5 bytes!
Fast forward to the 2000s and early 2010s, new Read V2 codes that were being created did not fit into this nice ruleset for some hierarchies. You can imagine things like `drugs` and `diseases` could clearly explode through these limitations, given how many of them are around.
It is also at this time that NHS Digital, to their credit, were developing SNOMED CT (at that time in collaboration with the College of American Pathologists) and started adding more content into SNOMED CT as the primary reference terminology. However, to ensure backwards compatibility with existing primary care systems, each time a new concept was added to SNOMED CT, it was also added to Read Codes. However, with the `limits` on the number of concepts that could legitimately fit into a `identifier`, increasingly newly created Read codes were added (i.e. placed) in inappropriate places in the hierarchies.
You can clearly imagine this causing issues — degrading findability, browsability and complicating reporting. So yes, for 15 years Read based analysis could rely on a simple rule
- To retrieve all patients with Hypertension, you could write the following query:
select * from PATIENT_EPISODE_TABLE where ENTRY_CODE like `G2…, G20..%, G24.. — G2z
But new cancers/diseases and variants are being discovered every month., But to fit these new entries in the right part of the hierarchy, we need to have `space` to fit the new concepts under `B`. But given the limitations above, there was no more `space` to fit newly created concepts, so we would have to place them in other parts of the hierarchy. The moment you do that, you break that notion of the `meaningful code` principle since your `relevant` codes could be in a totally different hierarchy.
Here is a snippet of the current number of child concepts under selected but popular Read hierarchies. It reports the Read V2 codes with the maximum possible number of child nodes (n=62).. plus those Read Codes approaching the maximum number of descendants. Notice how several clinically relevant concepts related to monitoring of major diseases like diabetes, cardiac disease, biochemical tests/screening, referral, nervous system symptoms are either already `out of space` or rapidly approaching that limit.
Read drug codes are excluded in the above snippet and including them would add dozens more! So bottom line, there isn’t enough `space` left in the identifier namespace to fit more content into Read in a meaningful way! A much longer version of this table and the limits on existing Read hierarchies has been compiled by my colleague Malcolm Duncan. In fact, it is his original analysis that provided the inspiration for this article.
So while good in its time (the 1980’s), by the early 2000’s Read V2 imposed increasing penalties on any host EMR system’s usability, utility and ultimately safety. Meanwhile the constraints on new content addition to Read V2 were clinically, technically and politically unworkable. Some of our team who worked in NHS Digital at that time recollect some of the pragmatic `hacks` that were required to continue publishing Read Codes before its eventual retirement in April 2016.
SNOMED CT Codes (Identifiers)
Let me put this into perspective for you, SNOMED CT in comparison to Read Codes has identifiers that can be anywhere between 6–18 digits long (essentially a 64 bit integer). To the computer scientists, that maximum `size/value` for a signed integer is `9223372036854775807` — the technical limit of 19 digits that I mentioned above. For the nerds, reading — that is 9,223,372,036,854,775,807 — in words 9.2 quintillion — that is ~9200 trillion!!
Fun fact, if I were creating 500 SNOMED CT codes (identifiers) a second and for each second I went back in time, I would end up 97 millions years ago in the past (97,490,402 BC)! I could hang out with the dinosaurs and according to Wikipedia hang out with the ancestors of the crocodiles!
Nerd Note: There is a reason why I chose 500 identifiers per second. The effective `ids` that are used up when a new concept is added to SNOMED CT is ~5 (please get in touch if you want to know why). So in reality you could be reading that maths above as `100 SNOMED CT concepts per second`…
Due to some other complex rules , namespaces, checksums and other conventions, the effective number of SNOMED CT codes we could create drops quite a bit, but fear not we should have enough there before we run into the `insufficient space` problem that Read Codes ran into. For a sense check, the current (June 2022) UK Drug extension has 467,949 concepts (active and inactive together) all of which fit nicely in SNOMED CT’s identifiers. In comparison, the final release of Clinical Terms (CTV3) published in April 2018 had 332112 concepts! Note, I am deliberately using CTV3 for reference instead of Read Codes V2 because CTV3 was a superset of Read Codes V2 which was retired in April 2016.
Things to consider
- If you are considering migrating away from a Read based system to something else, please avoid the temptation to extend Read (hopefully that is the biggest take away message)!!
- If you are building a modern EHR system, please avoid such elegant but simplistic design for any identifier system you might consider.
- If you are building a modern EHR system in the UK (and wider), always consider if your system can support SNOMED CT’s 18 digit identifiers.
- If you are dealing with legacy systems or data, then yes you might still need to cater to Read codes, however watch out for those caveats earlier resulting in concepts in inappropriate hierarchies.
If all the above sounds too complicated and you simply just want to use all these code systems doing lookups and crosswalk between them, then please check out Termnexus, our Terminology Server. If you would like to migrate away from legacy content that is based on Read Codes (or some variants/extensions of it) to use SNOMED CT, then our terminology/informatics experts could help. You could leverage Termnexus and our expertise, to take away the pain of mapping from legacy content to SNOMED CT!
Want to read more?
Check out the first two parts of this article below!
Part 1 – https://termlex.com/snomed-ct-an-introduction-part-1/
Part 2 – https://termlex.com/snomed-ct-an-introduction-part-2/
Check us out on Medium, LinkedIn, and Twitter!
Other Resources
- Termlex SNOMED CT Authoring, Maintenance and Migration services: https://termlex.com/snomed-ct-content-authoring-migration-services/
- Termlex SNOMED CT Implementation and Content Management services: https://termlex.com/snomed-ct/
- NHS Digital — Read codes: https://digital.nhs.uk/services/terminology-and-classifications/read-codes