Saturday, February 17, 2018

If you only have a hammer ...

... everything looks like a nail.
and if you only know XPT format, everything looks like a table...

Why this comparison? The trigger for this was a recent proposal of the CDISC QRS ("Questionnaires, Ratings and Scales") team to add a table to SDTM, a so-called "Trial Lookup Table" ("TL" domain), containing metadata, like information about instruments. They provided a number of examples, one with codes assigned to questions on a questionnaire, one about questions on a questionnaire sharing an identical set of possible answers ("scale"), and one about "logically skipped items". The latter sounds very much like CDISC-ODM "skip questions", but we are unfortunately not allowed to use ODM nor its "tabular" equivalent Dataset-XML for submissions.

Essentially, the table ("domain") they propose is nothing else than an "entity-attribute-value" model, a type of tables that can contain almost everything as it is about key-value pairs.

For example, for the simple case that a question can have answers ranging from 0 to 4:

 The proposed table is:

i.e. they need 5 rows for explaining to the reviewer that for that specific question (specified by TLVAR1, TLVARVAL1, ...) the possible values are 0 to 4.
That there is an ODM file with the study design that exactly states this in a much simpler way does not come up in the mind of the people who proposed this, as all they know is ... XPT tables!

Fortunately, we do have a few open-minded volunteers within CDISC who think further than tables ("the earth is not flat"), so Sally Cassels of Next Step Clinical Systems immediately demonstrated how this should be done in a much more simple way in the define.xml. For example, for the "scale sample", this information simply goes into a "ValueList" in the define.xml. The human-readable presentation of this (I do want to shield you from the XML although it is extremely simple) is:

 and for the scale values (0-4) themselves in the codelist (wich was already in the ODM in the study design):

So how do we educate these people who can only think in terms of tables that also the clinical research world is not flat? I would propose that every member of any of the SDTM development teams first must attend a define.xml course (where we explain such things very well), before they come with "yet-another-table" proposal.

And if you now say: "well Jozef, then you need to take an SDTM training too", I can say "I did"!

Saturday, January 13, 2018

Why changing "Submission Value" into "Preferred Term" is a bad idea

The CDISC-CT team recently published a new Controlled Terminology Package 33 for public review. At the same time, a proposal for changing the column header from "CDISC Submission Value" to "CDISC Preferred Term" was published:

In this blog, I will explain why this is a bad idea and why CDISC members should protest against it.

You can already find my own protest here:

First of all, we need to take into account that CDISC controlled terminology is based on tradition rather than on science. CDISC controlled terminology is a set of "lists", without any relations between the terms. CDISC members can ask to add terms based on their own, local usage of a term.
For example, last automn, I asked to add "centimeter mercury column" to the "UNIT" list as in the country I originate from (Belgium) blood pressure is measured (by tradition) in "centimeter mercury column" rather than in "millimeter mercury column". So CDISC added it to the list. What is however not visible from that list is what the relation is between "centimeter mercury column" and "millimeter mercury column". As a human, I know that 1 cmHg = 10 mmHg. But how does my computer know that? Does the CDISC-CT allow to know how to convert "pounds per square inch" into "millimeter mercury column"? If CDISC would allow UCUM notation, such unit conversions can easily be automated. And how does my computer know that (for CDISC codes) "SEVERE" is worse than "MODERATE" is worse than "MILD"? This all is not part of CDISC-CT.

Also, CDISC is publishing codelists for things it has no authority in. For example, it publishes "lists" of microorganisms (codelist MICROORG), whereas specialists in the field have developed taxonomies (for example NCBI) and also SNOMED-CT has a full taxonomy of microorganisms: 

The NCBI and SNOMED-CT taxonomies of microorganisms is based on science, the CDISC "list" of microorganisms is based on allowing members to add terms to the list based on the tradition how they name a microorganism locally. In the CDISC-CT list of microorganisms, you will not find any information on how these organisms are related to each other - it is just a list.

There are some cases where these "lists" based on tradition make sense, for example for "vital signs test code" (VSTESTCD/VSTEST), although this is also already covered by a scientific taxonomy developed by LOINC:

We indeed need to realize that LOINC is not yet used in every hospital, although it is mandated to be used in electronic health records in many countries and by the US "Meaningful Use" program, so such a VSTESTCD codelist can be used as a temporary solution, but it should not be forever.

So, the proposal to change the column header from "CDISC Submission Value" to "CDISC Preferred Term" is suggesting that in the whole clinical research process (and thus not only in submissions to regulatory authorities) we should start using terms that are based "on tradition", and forget about all the science. So it suggests that instead of writing "Glucose" in our protocols, we should start writing "GLUC", or instead of writing "measure the number of Metamyelocytes/100 leukocytes" (LOINC code 28541-1) in our protocols, we should put "BASOMM" as that is the CDISC "preferred term" and then also add a "method" from the "METHOD" CDISC codelists, and add additional terms from other CDISC-CT lists to complete the description of "measure the number of Metamyelocytes/100 leukocytes, use LOINC code 28541-1".

Changing the designation "CDISC Submission Value" into "CDISC Preferred Term" would be a very dangerous evolution. It would isolate us further from other standardization organizations for which there is an overlap in application area. It would make the statement to these SDOs saying "We don't need you".
And it would mean that CDISC completely "says goodbye" to the use of concepts that are based on science.

A second major problem is that CDISC controlled terminology is tightly bound to the 30 year old, obsolete SAS Transport 5 format (XPT format), with its 8-character and 40-character limitations. This format is only used within CDISC, no other industry worldwide is using this anymore. For example, CDISC "test codes" (--TESTCD) are limited to 8 characters only, which must be ASCII characters, and may not start with a number. Test names (--TEST) are limited to 40 characters and must be ASCII characters. This has lead to some idiotic test codes and names, such as "Corpuscular HGB Conc Distribution Width" as "test name" for "test code" "CHDW" (NCI-ID C139068) where the word "Concentration" needed to be shortened to "Conc" because of the 40 character limitation. Also "CHDW" is meaningless as a mnemonic, due to the 8-characted limitation for --TESTCD.

So, when this proposal would be accepted, we are pinning everything we do in terminology, whether it is in submissions or in non-regulated research, to the outdated XPT format. This means that for everything that is "CDISC preferred"
  • is limited to 8 characters when it is a code
  • is limited to 40 characters when it is a name or description
  • is not allowed to have any characters outside the ASCII-range, so "ñ", "ü", "á" (spanish characters), no German characters like "ß", "ü", no Norwegian characters like "å" or "æ", no Japanese, no Chinese, no Arabic, no Korean, no ...
  • may not start with a number

Do we really want this? Do we really want to say to people who do not submit to regulatory authorities, but do want to use CDISC standards, that they should keep away from LOINC, from UCUM, from SNOMED-CT and NCBI coding, and use CDISC terms instead that
  • are nowhere else used in the world
  • that are based on tradition
  • that are not based on science at all
Do we want to say to them that their codes should be not longer than 8 characters, and that non-ASCII characters are not allowed as these do not comply to "CDISC preferred"? Should we force them to implement the limitations of the XPT format in their systems? Highly probably, they do not use SAS-XPT at all.
This CDISC-CT proposal indeed looks like "megalomania" to me.

It is already bad/sad/mad enough that for submissions, we are obliged to use controlled terminology that is not based on science, and now the CDISC-CT team wants to extend this to everything we do in clinical research. Have they really gone mad?

If you agree and/or feel the same way, please comment directly to CDISC on their JIRA "issue" site: You will need an account, but if you don't have one, you can create one using!default.jspa. Please take into account that this account is not the same as your "CDISC members" account.

Your comments here are of course always welcome!

Sunday, December 10, 2017

SDTM and CDISC-CT: fit for e-Source?

The title of this post should have been something like "CDISC SDTM and Controlled Terminology post-coordinated versus pre-coordinated", but then most people would probably have no idea what I am talking about. So a little bit of explanation first.

CDISC SDTM uses "post-coordinated" controlled terminology. This means that controlled terms are combined "as needed" so that they can be build "as required". The consequence is that the result is dynamic, the ontology is "what you see", and any combination of terms is possible. So essentially, the combination of e.g. LBTESTCD=Albumin with LBSPEC=Blood and LBMETHOD=dipstick is valid, although you can't test albumin in blood using the dipstick method (that method is only available for albumin in urine).
"Post-coordination" has its advantages. It brings (some) order into chaos. It is especially useful when it is not known in advance (or cannot be envisaged) which tests will be performed.

Most systems in healthcare use "pre-coordination". This means that any possible combinations are assembled in advance and, when meaningful, obtain a single code. So not all combinations are possible. An example of such a system is LOINC. So in LOINC, you won't find a code for "albumin in blood measured using dipstick", but you will find a code (1751-7) for "albumin in serum or plasma measured quantitatively as mass/volume". Pre-coordinated are (must be) precise: each code should uniquely describe a term (a test in this case).

CDISC SDTM findings domains have been developed to bring "order in chaos". Essentially this means the paper world or the world where protocols do not precisely describe which tests need to be performed. For example, in the famous LZZT protocol we find the following tests defined: "Urinalysis: Color, Specific gravity, pH, Protein, Glucose, Ketones, Bilirubin, ". That's it. So not very precise. The problem with this is that each site can (and will probably) perform different tests. For example, for "glucose in urine", LOINC lists over 20 different tests (even when excluding all the "post" and "challenge" tests). When then submitted, post-coordination is necessary, but the results will not be comparable between sites, studies and sponsors. Even the combination of LBTESTCD (essentially the analyte), LBSPEC (the specimen, e.g. "urine") and LBMETHOD does not guarantee at all a unique combination. So it is no wonder at all that the FDArecently mandated the use of LBLOINC, i.e. it requires (as of 2020) that additionally, the unique LOINC identifier is added.

The problem however is not limited to laboratory tests alone. For example, there has been a discussion on the CDISC wiki about the "ebola vital signs CRF", about how the important test " highesttemperature in the last 24 hours" must be annotated for SDTM. Using SDTM, it cannot be done, as there is no way to define "in the last 24 hours". 

The solution is however simple when using LOINC: the LOINC code 8315-4 "Body temperature 24 hour maximum" very exactly describes this test.

Remark that the argument "pre-coordination could result in an explosion of new CT terms ..." is nonsense if CDISC finally allows LOINC to be used (it is not a problem in healthcare ...).
This means that our current SDTM findings variables are not always able to exactly describe tests, even when using post-coordination.

Nowadays, we see that research data are more and more extracted from electronic health records (EHRs) and hospital information systems (HIS), rather than collected separately (e-Source). There are even voices that say: "in 5 years from now, everything will be e-Source". Data from e-source is almost always pre-coordinated, i.e. using pre-coordinated terminology like LOINC, SNOMED-CT, etc..
When e-Source data is used, and the data is submitted, the pre-coordinated terminology must be translated to post-coordinated terminology, which is arbitrary, ambiguous, and not always possible, as the "highest temperature in the last 24 hours" example clearly shows. For lab tests, we can use the LOINC tests 5792-7, 22705-8 and 25428-4 as an example: all three would be modeled in SDTM as LBTESTCD=GLUC, LBSPEC=URINE and LBMETHOD=TEST STRIP. One can only distinguish by looking at the results themselves and at the units used.

Both examples "maximum temperature in the last 24 hours" and "glucose in urine by test strip" demonstrate that information loss is possible or even unavoidable. So, even when the test is exactly described by a pre-coordinated code (LOINC, SNOMED, …), we are forced to submit using a post-coordinated system with loss of information or test uniqueness.

This leads me to an important conclusion: the current SDTM is not fit for use with e-Source.
It is great for the paper world and for classic EDC where data is collected separately from the healthcare world.

How can we do better? Especially when the statement "everything will be e-Source in 5 years from now" becomes true.
In the past, I published an article "An Alternative CDISC-SubmissionDomain for Laboratory Data (LB) for Use with Electronic Health Record Data" in the "European Journal of BioMedical Informatics" (EJBI) where I proposed that, at least for laboratory data coming from e-Source", the typical LBTESTCD, LBTEST, LBSPEC and LBMETHOD are replaced by  a set of variables that align with the 6 dimensions of LOINC.

However, this only provides a solution for laboratory data using LOINC. There are however more coding systems used in e-Source data. For example, for microbiology data, NCBI coding is often used. This means that when using e-Source, data (pre-coordinated) using NCBI coding must be translated to one or more of the SDTM variables in the SDTM domain, which uses its own CDISC controlled terminology, and with guaranteed loss of information, as NCBI is much more specific.
Essentially, all this means that we need an alternative "e-Source" domain for each of the existing SDTM findings domains. These new domains can be much simpler than the existing SDTM domains, as much of the information for which several variables are needed in the "classic" domains, can now be in one single variable, the "test code". As these domains need to be "code system neutral", the core variables in these "e-Source" domains would be "test code", "code system" and maybe "test name". The latter is even not necessary, as there is a 1:1 relationship with "test code" and can easily be looked up automatically by computer systems e.g. using one of the many RESTfulweb services from NLM, UMLS, NIH, HIPAA etc..
So for example, for the "e-Source LB" domain, the core variables would be:
"Study ID" to "Sponsor-Defined Identifier" and then "test code" and "test system", "original test result", "original result units" (using UCUM). The classic LBCAT, LBSCAT, LBSPEC, LBMETHOD can be removed, as they are all included yet in the pre-coordinated "test code". Remark that I avoid to assign variable names, as e.g. "LBTESTCD" would mean completely different things in both variations of the LB domain. In the e-Source domain it would mean "the unique test code" whereas in the classic domain, LBTESTCD is essentially misleading, as it specifies the analyte, and not the test (remark that –TESTCD has a different meaning depending on the domain in classic SDTM).
In the "e-Source" LB domain, the first records in example 1 of the SDTM-IG (page LB-5) would look like:

Study Identifier
Subject ID
Sequence Number
Test Code
Code System
Original Result
Original Result Units (UCUM)

Using UCUM notation is important, as UCUM notation is almost always used in e-Source, and we don't won't information loss nor conversion errors. Even more, UCUM allows automated conversions (e.g. for the "standardized result"), using one of the RESTful web services available (NLM and our own one).
The next columns in the "e-Source" LB domain would then be "reference range indicators", and the "standardized results". The latter could then use (at least for quantitative results) e.g. use the "LOINC proposed unit".
Similarly, for the "ebola hightest temperature in the last 24 hours", which cannot be exactly described at all in classic SDTM, the "e-Source" VS domain could contain a record like: 

Study Identifier
Subject ID
Sequence Number
Test Code
Code System
Original Result
Original Result Units (UCUM)

Also here, VSCAT (and VSCAT) are not used, as it is already comprised in the test code 8315-4. In most cases, even VSPOS will be unnecessary (for e.g. blood pressure), as it is already included in the LOINC code.

As it is clear that the current SDTM is not fit for use with e-Source, we make a first proposal for a set of "e-Source" findings domains, using pre-coordinated coding systems (as is already used in e-Source), and using UCUM as much as possible for unit notation.
These "e-Source" domains are not meant to replace the "classic" SDTM domains, as these remain their value for the "classic" case where data is collected separately (paper, classic EDC). These "classic" domains can then only be deprecated when "everything is e-Source", so maybe in 5 years from now?
Please remark that with this first proposal, I do not encourage the use of "tables" for regulatory submissions. At the contrary, on the longer term, we need to go to submission of "biomedical concept" data points or "resources".
But that's another discussion