Sunday, February 5, 2017

SDTM-IG v3.2 Conformance Rules v1.0 - First implementation experiences

This weekend, I started implementing the just published "SDTM-IG v.3.2 Conformance Rules v.1.0" under the umbrella of the "Open Rules for CDISC Standards" initiative, an initiative of a number of CDISC volunteers (not a formal team) to implement CDISC conformance rules in a vendor-neutral, open (non-propriety), free, machine-executable but also human-readable format. For this, the W3C open standard "XQuery" was selected, as it makes the rules-implementation independent from the tool used, i.e. everyone can generate tools that implement XQuery and the published rules.

This is not new: also the earlier published FDA, PMDA and CDISC-ADaM rules have been implemented in XQuery (as far as they make sense) and can be downloaded and used by everyone.

The new published SDTM-IG rules come as an Excel file. This is unfortunate, as this format does not allow to use the rules directly in software: the rules need to be read (by a human), interpreted, and then "translated" into software. This is far from ideal, as it leaves a lot of "wiggle room" in the interpretation of the rules. Fortunately however, the team published the rules with sets of pseudocode, with a "precondition" like "VISITNUM ^= null" (meaning: the value of VISITNUM is not null) and the rule itself, like "VISITNUM in SV.VISITNUM". So the rule can be read as "when VISITNUM is not null, then its value must be found in the SV dataset as a VISITNUM value".
This is a great step forward relative to the FDA conformance rules, which are "narrative text only" rules, and which sometimes seem to be just the result of the "CDISC complaint box" at the White Oak Campus.

It is now Sunday noon, and I could already implement about 40 rules (there are 400 of them), so there is still a lot of work to do. But I would already start sharing my first impressions:
  • Not all rules are implementable. Some of the rules are currently not "programmable", as they require information that is not in the datasets or define.xml. An example is "The sponsor does not have the discretion to exclude permissible variables when they contain data" (rule CG0015). Essentially, this is about traceability back to the study design and collected data sets (both usually in CDISC ODM format). Maybe in future the "Trace-XML extension", developed by Sam Hume, can help solving this.
    The Excel worksheet has a column "Programmable" (Y/N/C) and a column "Programmable Flag Comment", but I noticed that these are not always correct: I found rules that were stated to be non-programmable but which I think can be programmed, and vice versa.
  • Not all rules are very clear. Most of them are very clear, thanks to the "pseudo-code" that has been published, but sometimes, this is not enough. An example is:
    "Variable Role = IG Role for domains in IG, Role = Model Role for custom domains". Now, the "Role" is not in the dataset, and usually also not in the define.xml, as a "Role" is only necessary for non-standard variables, and is otherwise supposed to be the one from the IG or model. So what is the rule here? What is checked? I have no idea!
    In other cases, I needed to look into the "public comments" document to understand the details of the rule. It would have been great if the published document would also have had a "rule details" column with extra explanation, with inclusion of the anwers to the questions from the public review period.
  • A good rule has a precondition, a rule, and a postcondition. The two first are present, but the postcondition is failing. The latter describes what needs to be done in case the rule is obeyed or violated. In our case, this would normally be an error or warning message. 
  • Good rules are written in such a way that they are easy to be programmed. Rules like "VISIT and VISITNUM have a one-to-one relationship" are not ideal: they can better be split in 2 rules, one stating something like "for each unique value of VISITNUM, there must be exactly one value of VISIT", and the other one stating; "for each unique value of VISIT, there must be exactly be one value of VISITNUM". This is easier to implement, and also very important, allows to generate a violation message that is much clearer and detailed. Also from the description of some of the other rules, it is clear that the rule developers did not test them (by writing code) whether they are easy to implement or not. 
  • There were also a lot of things that I liked a lot:
  • The document does not distinguish between errors and warnings. The word "error" does not even appear in the worksheet. Good rules are clear and can be violated or not (and not something in between), Therefore setting something to "warning" is never a good idea in rule making, as it usually has no consequences (with the exception of the yellow card in soccer maybe). The use of "Warning" in the FDA rules has generated a lot of confusion, For example in the Pinnacle21 validation software, you get a warning when the "EPOCH" variable is absent in your dataset ("FDA expected variable not found"), but you also get a warning when it is present ("Model permissible variable added into standard domain"). So, whatever you do, you always get a warning on "EPOCH"!
  • More than in the FDA rules, the define.xml is considered to be "leading". For example rule CG0019: "Each record is unique per sponsor defined key variables as documented in the define.xml". This is not only a very well implementable rule, it is also much better than the (i.m.o. not entirely correct) implementation in the Pinnacle21 tool, which usually leads to a large amount of false positives, as it completely ignores the information in the define.xml.
    But also here, some further improvement is possible. For example for rule CG0016 "... a null column must still be included in the dataset, and a comment must be included in the define.xml to state that data was not collected". It does not state where the comment in the define.xml must come (my guess: as def:CommentDef referenced by the ItemDef for that variable, but that isn't said), or what the contents of the comment should be. So how can I implement this rule in software? Is the absence of a def:CommentDef for that variable sufficient to make it a violation? 
In the software world, when there is an open specification, there usually is a so-called "reference implementation". This means that everyone is allowed to generate its own implementation of the specification, but the results must be exactly the same as generated by the reference implementation for a well-defined test set. Other implementations may add additional features, excel in performance, and so on.
Ideally, the source code of the reference implementation is open, so that everyone can inspect the details of the implementation.

Also for this kind of rules (as well the ones from FDA, PMDA, ...) we would like that a reference impementation is published together with the rules. This reference implementation should be completely open, and written in such a way that the rules are written in a way that they are at the same time human-readable as machine-executable. Our XQuery implementation comes close to this.
The people behind the "Open Rules for CDISC Standards" initiative will surely discuss this with CDISC. So maybe you will somewhere in future hear or read about a reference implementation of the "SDTM-IG v.3.2 Conformance Rules v.1.0" written in XQuery!

The rules that I implemented can currently be downloaded from: "http://xml4pharmaserver.com/RulesXQuery/index.html" (also the FDA and PMDA rule implementations can be found there).  You can inspect each of the rules, even when you never used XQuery before, you can use them in your own software (even in SAS), and you can try them out with the "Smart Dataset-XML Viewer" by copying the XML file with the rules in the folder "Validation_Rules_XQuery" (just create it if not there yet), and the software will "see" them immediately.
We are currently also implementing a RESTful webservice for these rules, allowing applications to always (down)load the latest version of each rule (no more need to "wait until the next release ... maybe next year...")

Keep pinging the website, as I am intending to make rapid progress with the implementation of these 400 rules. I want to try to add a few new rule implementations every day. I hope to have everything ready by eastern (2017 of course). 

And if you like this and would like to cooperate in the "Open Rules for CDISC Standards" initiative, or would like to provide financial support (so that we can outsource part of the work), just mail us, and we will get you involved! Many thanks in advance!

And last but not least: congratulations for this great achievement to the "SDTM Validation team"! We are not completely there yet, but with this publication, a great step forward was made!