Sunday, February 5, 2017

SDTM-IG v3.2 Conformance Rules v1.0 - First implementation experiences

This weekend, I started implementing the just published "SDTM-IG v.3.2 Conformance Rules v.1.0" under the umbrella of the "Open Rules for CDISC Standards" initiative, an initiative of a number of CDISC volunteers (not a formal team) to implement CDISC conformance rules in a vendor-neutral, open (non-propriety), free, machine-executable but also human-readable format. For this, the W3C open standard "XQuery" was selected, as it makes the rules-implementation independent from the tool used, i.e. everyone can generate tools that implement XQuery and the published rules.

This is not new: also the earlier published FDA, PMDA and CDISC-ADaM rules have been implemented in XQuery (as far as they make sense) and can be downloaded and used by everyone.

The new published SDTM-IG rules come as an Excel file. This is unfortunate, as this format does not allow to use the rules directly in software: the rules need to be read (by a human), interpreted, and then "translated" into software. This is far from ideal, as it leaves a lot of "wiggle room" in the interpretation of the rules. Fortunately however, the team published the rules with sets of pseudocode, with a "precondition" like "VISITNUM ^= null" (meaning: the value of VISITNUM is not null) and the rule itself, like "VISITNUM in SV.VISITNUM". So the rule can be read as "when VISITNUM is not null, then its value must be found in the SV dataset as a VISITNUM value".
This is a great step forward relative to the FDA conformance rules, which are "narrative text only" rules, and which sometimes seem to be just the result of the "CDISC complaint box" at the White Oak Campus.

It is now Sunday noon, and I could already implement about 40 rules (there are 400 of them), so there is still a lot of work to do. But I would already start sharing my first impressions:
  • Not all rules are implementable. Some of the rules are currently not "programmable", as they require information that is not in the datasets or define.xml. An example is "The sponsor does not have the discretion to exclude permissible variables when they contain data" (rule CG0015). Essentially, this is about traceability back to the study design and collected data sets (both usually in CDISC ODM format). Maybe in future the "Trace-XML extension", developed by Sam Hume, can help solving this.
    The Excel worksheet has a column "Programmable" (Y/N/C) and a column "Programmable Flag Comment", but I noticed that these are not always correct: I found rules that were stated to be non-programmable but which I think can be programmed, and vice versa.
  • Not all rules are very clear. Most of them are very clear, thanks to the "pseudo-code" that has been published, but sometimes, this is not enough. An example is:
    "Variable Role = IG Role for domains in IG, Role = Model Role for custom domains". Now, the "Role" is not in the dataset, and usually also not in the define.xml, as a "Role" is only necessary for non-standard variables, and is otherwise supposed to be the one from the IG or model. So what is the rule here? What is checked? I have no idea!
    In other cases, I needed to look into the "public comments" document to understand the details of the rule. It would have been great if the published document would also have had a "rule details" column with extra explanation, with inclusion of the anwers to the questions from the public review period.
  • A good rule has a precondition, a rule, and a postcondition. The two first are present, but the postcondition is failing. The latter describes what needs to be done in case the rule is obeyed or violated. In our case, this would normally be an error or warning message. 
  • Good rules are written in such a way that they are easy to be programmed. Rules like "VISIT and VISITNUM have a one-to-one relationship" are not ideal: they can better be split in 2 rules, one stating something like "for each unique value of VISITNUM, there must be exactly one value of VISIT", and the other one stating; "for each unique value of VISIT, there must be exactly be one value of VISITNUM". This is easier to implement, and also very important, allows to generate a violation message that is much clearer and detailed. Also from the description of some of the other rules, it is clear that the rule developers did not test them (by writing code) whether they are easy to implement or not. 
  • There were also a lot of things that I liked a lot:
  • The document does not distinguish between errors and warnings. The word "error" does not even appear in the worksheet. Good rules are clear and can be violated or not (and not something in between), Therefore setting something to "warning" is never a good idea in rule making, as it usually has no consequences (with the exception of the yellow card in soccer maybe). The use of "Warning" in the FDA rules has generated a lot of confusion, For example in the Pinnacle21 validation software, you get a warning when the "EPOCH" variable is absent in your dataset ("FDA expected variable not found"), but you also get a warning when it is present ("Model permissible variable added into standard domain"). So, whatever you do, you always get a warning on "EPOCH"!
  • More than in the FDA rules, the define.xml is considered to be "leading". For example rule CG0019: "Each record is unique per sponsor defined key variables as documented in the define.xml". This is not only a very well implementable rule, it is also much better than the (i.m.o. not entirely correct) implementation in the Pinnacle21 tool, which usually leads to a large amount of false positives, as it completely ignores the information in the define.xml.
    But also here, some further improvement is possible. For example for rule CG0016 "... a null column must still be included in the dataset, and a comment must be included in the define.xml to state that data was not collected". It does not state where the comment in the define.xml must come (my guess: as def:CommentDef referenced by the ItemDef for that variable, but that isn't said), or what the contents of the comment should be. So how can I implement this rule in software? Is the absence of a def:CommentDef for that variable sufficient to make it a violation? 
In the software world, when there is an open specification, there usually is a so-called "reference implementation". This means that everyone is allowed to generate its own implementation of the specification, but the results must be exactly the same as generated by the reference implementation for a well-defined test set. Other implementations may add additional features, excel in performance, and so on.
Ideally, the source code of the reference implementation is open, so that everyone can inspect the details of the implementation.

Also for this kind of rules (as well the ones from FDA, PMDA, ...) we would like that a reference impementation is published together with the rules. This reference implementation should be completely open, and written in such a way that the rules are written in a way that they are at the same time human-readable as machine-executable. Our XQuery implementation comes close to this.
The people behind the "Open Rules for CDISC Standards" initiative will surely discuss this with CDISC. So maybe you will somewhere in future hear or read about a reference implementation of the "SDTM-IG v.3.2 Conformance Rules v.1.0" written in XQuery!

The rules that I implemented can currently be downloaded from: "" (also the FDA and PMDA rule implementations can be found there).  You can inspect each of the rules, even when you never used XQuery before, you can use them in your own software (even in SAS), and you can try them out with the "Smart Dataset-XML Viewer" by copying the XML file with the rules in the folder "Validation_Rules_XQuery" (just create it if not there yet), and the software will "see" them immediately.
We are currently also implementing a RESTful webservice for these rules, allowing applications to always (down)load the latest version of each rule (no more need to "wait until the next release ... maybe next year...")

Keep pinging the website, as I am intending to make rapid progress with the implementation of these 400 rules. I want to try to add a few new rule implementations every day. I hope to have everything ready by eastern (2017 of course). 

And if you like this and would like to cooperate in the "Open Rules for CDISC Standards" initiative, or would like to provide financial support (so that we can outsource part of the work), just mail us, and we will get you involved! Many thanks in advance!

And last but not least: congratulations for this great achievement to the "SDTM Validation team"! We are not completely there yet, but with this publication, a great step forward was made!

Thursday, January 5, 2017

Generating Define-XML: the Pinnacle21 roundtrip test

In my previous post, I presented our new "Define.xml Designer" software, implementing all "best practices for generating define.xml", but also allowing to generate extremely good define.xml files for legacy studies for which the SAS-XPT files are already present, but no define.xml exists yet.

It looks as many people are however still using the "Pinnacle21 Community Define.xml Generator", probably because it's free, and uses Excel as an input for the tool. The price for that is however that there is no user manual, no support, no graphical user interface. As there is no manual nor GUI, the originators advise users to load an existing define.xml into the tool, generate the Excel worksheet from that, and then adapt the worksheet for the current study, and then generate the new define.xml from the worksheet with the tool. This usually results in a number of "trial-and-error" cycles, each time changing the worksheet and have a new try, until the desired define.xml is obtained. However, when one knows the basic principles of XML (my students at the university learn these in less than 3 hours), I presume adapting the define.xml using an XML editor is considerably faster (and one understands what one does!).

A good test for such software is always to do a "round trip", i.e. taking a correct file, load it into the tool, and then exporting it again. In the case of the Pinnacle21 Define.xml Generator, this means loading an existing define.xml, exporting it to an Excel worksheet, and then generating a new define.xml from that worksheet, without having made any changes to it.
Ideally, the result should be that source define.xml and newly generated define.xml are 100% identical. No information should be lost, and no new information should have been added somehow. Existing information should not have been changed either.

Round-tripping is a typical quality test for software. Loading a file and exporting it again should result in no differences. So we did the test on the Pinnacle21 software (v.2.2.0) using the sample SDTM define.xml 2.0 file that comes with the standards distribution.

What are the results?

Let us first check whether any information was lost in the roundtrip. This is what we found:
  • we found that the "Originator" attribute on the "ODM" element disappears, as well as the "SourceSystem" and "SourceSystemVersion" attributes. These contain important information about who (organization) and what system generated the define.xml. As there is no manual, we could also not find out how one can reintroduce this important information using the tool.
  • we also found that the "label" of many of the variables had disappeared ("Description" element under "ItemDef" element). We found that this is the case when the variable is a "valuelist" variable. Inspection of the by the tool generated worksheet revealed that there is indeed no "Label" column in the worksheet in the "ValueLevel" tab. Maybe one should add one there manually, but as there is no user manual, there is no way we can find out. This also means that the as such generated define.xml file (without labels for valuelevel variables) is not only essentially invalid, but also not very usable for reviewers either as they cannot find out what the valuelist variable is about.
  • additionally, all "SASFormatName" attributes disappeared. Now, "SASFormatName" is an optional attribute, but it may have it's value to have it in the define.xml when a define.xml of one study is used as a template for a define.xml for a subsequent (similar) study (reuse).
The Pinnacle21 tool removes some of the important attributes on the ODM element (colored red)

Let us now check whether any information was added (silently) that was not in our original define.xml at all. 
  • Rather surprisingly, we found that a number of variable definitions were automatically added, although they were not in the original define.xml. We found that when a variable is defined once originally (e.g. STUDYID, USUBJID) and referenced many times (i.e. by each dataset), the Pinnacle21 refuses this kind of "reuse" and creates different variable definitions for STUDYID and USUBJID, for each dataset a  new one. So, in our original define.xml we had only 1 definition of STUDYID (with OID "IT.STUDYID), whereas in the newly generated define.xml we have over 30 of them (with OIDs "IT.TA.STUDYID", "IT.TE.STUDYID", "IT.DM.STUDYID", etc..). The same applies to USUBJID: instead of having a single definition of USUBJID, we suddenly have over 30 ones.
Did the tool change any information from our original define.xml file?
We found the following:
  • All OIDs (the identifiers) were altered, except for most of the ones of the valuelists (but not all of them) and of the codelists. It looks as in many cases the tool assigns the OIDs itself, without the possibility for the user to have any influence on this. As the OIDs are arbitrary, this is not a disaster, but it again means that one cannot use one define.xml as a template for a next one, especially when one has company-standardized OIDs for SDTM or SEND or ADaM variables.
The Pinnacle21 tool changes all the OIDs in the define.xml (or reassigns them)

We were shocked by the finding that the tool also alters the "Study OID" without any notice. In the original define.xml it's value is "cdisc01", in the newly created define.xml it is "CDISC01.SDTM-IG.3.1.2". We again suspect that the user cannot have influence on the assignment of the "Study OID". The same applies to the OID and Name attributes of the "MetaDataVersion" elements and the contents of its "Description" element: all these were changed by the tool without any notice.

OIDs of "Study" and "MetaDataVersion" have been altered, as well as "MetaDataVersion Name" and the "MetaDataVersion Description"

You might now ask yourself how our own "XML4Pharma Define.xml Designer" scores in the "roundtrip test". Well, you can easily find out yourself by requesting a trial version of the software and perform the roundtrip test yourself. This will also allow you to discover how user-friendly this new software is.

Conclusion: the Pinnacle21 "Define-XML Generator" does a pretty good job in generating a (prototype) define.xml starting from an Excel worksheet. The "round trip test" however shows that the user does not have any influence at all on how the OIDs are generated. Worse is that the labels for the "ValueList" variables are missing. Maybe this can be circumvented by adding an extra "Label" column in the worksheet for them, but as there is no user manual, there is no way to find out.
This means that the generated define.xml still requires manual editing (best by using an XML-editor - there are some free ones). This triggers the question whether taking an existing define.xml, and use an XML-editor for adapting it for a new study isn't the faster way, with the additional advantage that one is knowing what one is doing".
There are considerable better define.xml generating software tools on the market, with nice GUIs and wizards (including our own "Define.xml Designer"). These are not for free, but their cost is very reasonable, and e.g. only a fraction of what the "Pinnacle21 Enterprise Edition" costs.

Thursday, October 13, 2016

Units and ODM 2.0

A few of us are already making thoughts about what should be the requirements for a CDISC ODM 2.0 standard. Especially integration with healthcare is one of the main topics. Support for RESTful web services and an additoinal JSON implementation are surely on the list.

One of the main problems with the current version of ODM 1.3 is the way units of measure are handled. 10 years ago, when ODM 1.3 was developed, we were not aware of UCUM yet, nor of LOINC and other coding systems in healthcare. At that time, we were just starting experimenting with extracting information from electronic healthcare records (EHRs) anyway. ODM was very case report form (CRF) centric, without much consideration of how one can (automatically) populate a CRF from an electronic health record or a hospital information system (HIS) anyway.

The way units of measure are implemented in ODM is very simple: one just defines a list of units of measure, and then reference them later. For example:

Defining "inches" and "centimeters". What these exactly mean (e.g. that "centimeter" means 1/100 of a meter and that the latter is an SI unit) is not included, nor is any conversion information (e.g. that 1 inch is 2.54 cm). It even doesn't state what the property is (in the case "length").

In the definition of the data point (an ItemDef), these are then referenced, e.g. by:

Stating that the height can either be expressed in a unit "inches" or a unit "centimeter" whatever these may mean - a machine will not really understand. Also a machine will not understand that this is about body height. For that we need to add some semantic information, like a LOINC code. Currently, this can be done using the "Alias" child element:

(P.S.  some elements have been collapsed for clarity)
Also remark the "SDTM annotation", stating that this datapoint will later come into VSORRES in the case that VSTESTCD has the value "HEIGHT".

So, how is this implemented into the CRF? The ODM doesn't tell us. Are there two checkboxes on the CFR, one with "in" and one with "cm" and the investigator needs to check one of them? Or are there two versions of the CRF, one for anglo saxon countries with "inches" preprinted and one for countries using metric units and "cm" preprinted?

If there is only one unit of measure assigned, the case is clear. For example, for a blood presssure:

with a single reference to a unit of measure "millimeter mercury column":

A computer system however does not know what this really means (semantically), e.g. that it represents a pressure. We can however add that information by adding the UCUM notation, using an "Alias" again:

And if we look in the UCUM "ucum-essence.xml" file, we get all information for free:

stating that "meter mercury column" is a unit for the property "pressure" and that it is equal to 133.322 kilopascal. For "millimeter mercury column", the systems knows that in UCUM there is a prefix "m" with meaning "milli" and value "0.001" (also defined in the ucum-essence.xml).

This also allows to do unit conversions in an automated way, even using publicly available RESTful web services for UCUM unit conversions. Like that, a system can easily find out that a blood pressure of 2.5 [psi] (pounds per square inch) corresponds to 129.29 mm[Hg].

So, as an intermediate conclusion, we can state that it is already possible to give semantic meaning to measurements (by providing their LOINC code) and the units in which they are expressed (by providing the UCUM notation), by using the "Alias" mechanism.

This would also enable systems to automatically extract information from EHRs (e.g. using HL7 CDA or FHIR) as in these systems, body height is coded using the LOINC code "8302-2" and the value MUST be given using UCUM notation. For example (as FHIR):

with the LOINC code in the "code" element (middle part of the snapshot), the value in the "value" element (near the bottom) and the (UCUM) unit in the "code" element under "valueQuantity" (lower part of the snapshot).

Is this sufficient?

I do not think so.

"Alias" can be used for anything, and the content of "Context" is not standardized. Also, we should encourage the use of UCUM, as the current codelist for units developed by CDISC is a disaster anyway. Even for pre-clinical studies, to be submitted as SEND, the use of UCUM unis would be a great stepf forward. So we are thinking about "promoting" the UCUM notation to an attribute on MeasurementUnitDef itself, something like (but don't pin me on that!):

However, that doesn't solve everything...

When talking about measurements and units in clinical research, and especially for laboratory tests,  I think we can see the following categories:

  • The measurement has no unit. For example: "pH"
  • We know the exact unit of measure in advance. For example "millimeter mercury column" for a "blood pressure". This is covered by the current use of "MeasurementUnit" in ODM. The unit can then be preprinted on the form, and/or stored in the database as the one we know will always be the case
  • There is a choice of units. For example: choice between "cm" and "inches". Also this is covered except for how it is "rolled out", e.g. by different CRF versions based on culture or country
  • We don't know what units we will get back. This is often the case for lab tests.  Unfortunately, most protocols do not provide suffient details about what exactly should be done, they e.g. simply state "do a glucose in urine test". We can then expect a multitude of units (or their absence) back: one lab will report in mg/dL, one in mmol/L, other will provide ordinal information (1+, 2+, ... - no units), making comparison hard (how will we standardize to --STRESU in SDTM?). In such a case, the unit information is usually a field on the CRF. For example:

The latter is OK, as the "question" about the unit is just another question, we loose the information that a) it is a unit, and b) it is a unit for the albumin concentration. of course, this could e.g. be solved by an "Alias", like:

    but this is not a very elegant solution, as the content of "Alias" is not standardized.
    In CDA and FHIR, this is easy, as these do not define what is to be measured, just what has been measured. In ODM, it is just a bit more difficult.
    Now, I do not know the solution for this, but it is something that we (the XML technologies team) will need to tackle.

    Friday, September 30, 2016

    Generating Define-XML: new software

    Although my posting "Creating define.xml - best and worst practices" is the most read blog entry of my blog site, people seem not to learn (or maybe do not want to learn). Almost daily, I read complaints of people who use "black box software" for generating define.xml that is free of charge, starting from Excel worksheets, and not getting the result they would like to obtain, at least not when viewing the define.xml visualized by the stylesheet. Some even do not realize that what they see in the browser is not the define.xml, but only a visualization of it.

    There does not seem to be much user-friendly software on the market for generating or working with Define-XML. As already explained in the above-mentioned blog, the best way to generate a define.xml is "upfront" and a few software software packages for mapping to SDTM and at the same time generating a good define.xml exist. One of them is my own SDTM-ETL software.

    For people that cannot use this approach (e.g. for legacy studies), there was not much out there yet. They usually used the "Excel" approach, often leading to very bad results.

    We recently released a new software named the "ODM Study and Define,xml Designer 2016", which can be used for both setting up study designs in ODM format and for generating define.xml files. For the latter, 4 use cases are supported:

    • creating a define,xml from scratch
    • creating a define.xml starting from an SDTM template (SDTM 1.2, 1.3 or 1.4 - SDTM-IG 3.1.2, 3.1.3 or 3.2)
    • creating a define.xml starting from a set of SAS-XPT files
    • starting from an incomplete define.xml file, e.g. generated by other tools
    In all cases, the user can choose between define.xml v.1.0 and v.2.0. Also the upcoming v.2.1 will be supported as soon as it published by CDISC. The user can also choose between all CDISC controlled terminology versions that were released since 2013.

    Unlike the "black box tools", the software comes with a very nice graphical user interface, has very many wizards, and performs validation using validation rules developed by CDISC. For example for generating the define.xml v.2.0 "Where Clauses", there is a wizard:

    making it extremely easy to develop "where-clauses". 

    At each moment during the process, the user can inspect the generated define.xml, either as XML, as a tree structure, or visualized in the user's favorite browser, and using the default CDISC stylesheet or using an own stylesheet. 

    The validation features go beyond anything else that is currently available, and can be done on different levels. Moreover, unlike with other tools, no false positive errors are generated. This is due to the fact that the developer of the software (well, that's me, a CDISC volunteer for 15 years now) is one of the co-developers of the Define-XML standard, and a CDISC authorized Define-XML trainer (I give most of the CDISC Define-XML trainings in Europe), and thus knows every detail of the standard.

    The software is not free-of-charge, but it is not expensive either. So, there is now no excuse anymore for generating bad define.xml files!

    Information, including a user manual, can be found at: 

    Thursday, September 15, 2016

    FDA and SAS Transport 5 - survey results

    As promised on LinkedIn, I analyzed the results of the survey where people were asked the question "In my opinion, the FDA should ..." with the following possible answers:
    • Continue requiring SAS-Transport-5 (XPT) as the transport format 
    • Move to XML (e.g. CDISC Dataset-XML) as soon as possible
    • Move to RDF as soon as possible
    • Other 
    People were also asked who they are working for: Pharma Sponsor,  CRO, Service Provider, Software Company, or Other.
    We had 57 answers (which is considerably less than I had hoped for). Here are the first results:

    with a relative good distribution between all groups (some ticked more than 1 box), with a slight overrepresentation of pharma sponsors (which isn't a surprise as they do the FDA submissions).

    And here come the results about the question on the exchange format:

    Over 50% voted for moving to an XML-based format like CDISC-Dataset-XML, about 25% for moving to RDF. A minority of less than 20% voted for continuation of the current FDA policy to require SAS Transport 5.

    I tried to make a detailed analysis looking for relations between the answer about the preferred format and the company type, but didn't find any. The only slight trends I could see (but statistically not significant at all) is that RDF is a bit overrepresented in the "Sponsor" group, and that "SAS-Transport-5" is slightly overrepresented in the "CRO" group. Only 3 (out of the 20) "sponsor voters" voted for "Continue requiring SAS-Transport-5".

    The survey also allowed to provide comments. Here are the most interesting ones:
    • If it's not broken, don't fix it.  Pharma is a big industry and slow to change/adapt 
    • We must move beyond the restrictive row/column structure
    • SDTM is useless and error prone. We need modern data models and semantics
    • Consider JSON also. Get rid of Supplemental domains
    • Going for RDF means that ADaM, SDTM and the rest could be all linked together ...
    If anyone would like to analyze the results in more detail, just mail me and I can send the complete results as a spreadsheet or as CSV or similar.

    Saturday, June 4, 2016

    MedDRA rules validation in XQuery and the Smart Dataset-XML Viewer

    Yesterday, I accomplished something that I believed was difficult, but after all wasn't: to develop the FDA and PMDA MedDRA validation rules in XQuery (it's easy if you know how).

    The problem with MedDRA is that it is not open and public - you need a license. After you got one (I got a free one as I, as a professor in medical informatics, use MedDRA for research. Once you have the license, you can download the files. When I did, I expected some modern file format like XML or JSON or so, but to my surprise, the files come as oldfashioned ASCII (".asc") files with the "$" character as field separator. From the explanations that come with the files, it is said that the files can be used to build a relational database. However, the license does not allow me to redistribute the information in any form, so I could not build a RESFful web service that could then be used in the validation. As also the other validator just uses the ".asc" files "as is", I needed to find out how XQuery can read ASCII files that do not represent any XML.
    I regret that MedDRA is not open and free for everyone (CDISC controlled terminology is). How can we ever empower patients to report adverse events when each patient separately needs to apply for a MedDRA license? This model is not of this time anymore...

    The FDA and PMDA each contain about 20 rules that involve MedDRA. One of them is i.m.o. not implementable in software. Rule FDAC165/(PMDA)SD2006 states "MedDRA coding info should be populated using variables in the Events General Observation Class, but not in SUPPQUAL domains". How the hell can a software know whether a variable in a SUPPQUAL domain has to do with MedDRA? The only way I can see is that there is codelist attached to that variable pointing to MedDRA. If this is not the case, one can only guess (something computers are not so good in).

    As MedDRA files are text files that do not represent XML, we cannot use the usual XQuery constructs to read them. Fortunately, XQuery 3.0 comes with the function "unparsed-text-lines()" which (among others) takes a file address as an argument. The file address however needs to be formatted as a URI, e.g.:


    This function reads the file line by line. If it is then combined with the function "tokenize" which split strings in tokens based on a field separator, then XQuery can also easily read such oldfashioned text files. So the beginning of our XQuery file (here for rule FDAC350/SD2007), after all the namespace and functions declarations, looks like:

    The first five lines in this part (18-22) define.the location of the define.xml file and of the MedDRA pt.asc (preferred terms file). For each of them, we use a "base" as we later want to enable that these are passed from an external program.

    In line 24, the file is parsed, the result is an array of strings "$lines". In lines 26-29, we select the first item in each line (with the "$" character as the field separator). As such "$ptcodes" now simply consists of all the PT codes (preferred term codes).

    Then, the define.xml file is read, and the AE, MH and CE dataset definitions are selected:

    An iteration is started over the AE, MH and CE datasets (note that the selection allows for "splitted" datasets), and in each of them, the define.xml OID is captured of the --OID is captured, together with the variable name (which can be "AEPTCD", "MHPTCD" or "CEPTCD"). The location of the dataset is then obtained from the "def:leaf" element.

    In the next part, we iterate over each record in the dataset, and get the value of the --PTCD variable (line 50):

    and then check whether the value captured is one in the list of allowed PT codes (line 53). If it is not, an error message is generated (lines 53-55).

    That's it! Once you know how it works, it is so easy: it took me about less than 15 minutes to develop each of these 20 rules.

    I talked about these XQuery rules implementations with an FDA representative at the European CDISC Interchange in Vienna. When he saw the demo, his face became slightly pale, and he asked me: "Do you know what we paid these guys to implement our rules in software, and you tell me your implementation comes for free and is fully transparent?".

    Beyond free, open and fully transparent (if that were not sufficient) the advantage of these rules is that the rules are completely independent of the software to do the validation: anyone can now write his own software without needing to code the rules themselves. You could even create a server that validates your complete repository of submissions during night time. As the messages come as XML, you can easily reuse them in any application that you want (try this with Excel!).

    In the next section, I would like to explain how extremely easy it is to write software for executing the validations. The "Smart Dataset-XML Viewer" allows you to do these validations (but you can choose not do do any validation at all, or only for some rules), so I just took a few code snippets to explain this. We use the well-known open source Saxon library for XML parsing and validation, developed by the XML-guru Michael Kay, which is both available for Java and for C# (.NET). If you would like to see the complete implementation  of our code, just go to the SourceForge site, where you can download the complete source code of just browse through. The most interesting class is the class "XQueryValidation" in the package "edu.fhjoanneum.ehealth.smartdatasetxmlviewer.xqueryvalidation".
    Here is a snippet:

    First of all, the file location of the define.xml is transformed to a "URI". A new StringBuffer is prepared to keep all the messages. In the following lines, the Saxon XQuery engine is initialized

    and the base location and file name of the define.xml file is passed to the XQuery engine (remark that the define.xml can also be located in a native XML database, with one collection per submission, something that also the FDA and PMDA could easily do). This "passing" is done in the lines with "exp.bindObject" (in the center of the snippet).
    In case MedDRA is involved in the rule execution, the same is done in the last part of the snippet (whether a rule required MedDRA is given by the "requiresMedDRA" attribute in the XML file containing all the rules:

    The rule is then executed, and the error messages (as XML strings) captured in the earlier defined StringBuffer:

    So, the contents of the "messageBuffer" StringBuffer is essentially an XML that can be parsed, or just written to file, or transformed to a table, or stored in a native XML database, or ...

    In order to accept the passing of parameters from the program to the XQuery rule, we only need to change the hardcoded file paths and locations to "external" ones, i.e. stating that some program will be responsible for passing the information. In the XQuery itself, this is done by:

    As one sees, lines 17-19 have been commented out, and lines 14-16 are lines 14-16 declare that the values for the location of the define.xml file and of the directory with MedDRA files will come from a calling program.

    In the "Smart Dataset-XML Viewer", the user can himself decide where the MedDRA files are located (so it is not necessary to copy files to the directory of the application), using the button "MedDRA Files Directory":

    A directory chooser than shows up, allowing to set where the MedDRA files need to be read from. This can also be a network drive, as is pretty usual in companies.

    If you are interested in implementing these MedDRA validation rules, just download them from our website, or use the RESTful web service to get the latest update.

    Again, the "Smart Dataset-XML Viewer" is completely "open source". Please feel free to use the source code, to extend it, to use parts of it in your own applications, to redistribute it with your own applications, etc.. Of course, we highly welcome it when you also donate source code of extensions that you wrote back, so that we can further develop this software.

    Friday, May 20, 2016

    Electronic Health Record Data in SDTM

    The newest publication of the FDA about "Use of Electronic HealthRecord Data in Clinical Investigations" triggered me to pick up the topic of EHRs in SDTM again. The FDA publication describes the framework in which use of EHRs in clinical research is allowed and encouraged. Although it does not contain really new information, it should take the fears of sponsors and investigators away for use of EHR data in FDA regulated clinical trials.

    One of the things that will surely happen in future is that the FDA reviewer wants to see the EHR data point that was used as the source of a data point in the SDTM submission. The investigator will then ask the sponsor who will then ask the site...: another delay in bringing this innovatine new drug or therapy to the market. In the mean time patients will die ...

    So can't we have the EHR datapoint in the SDTM itself?

    Of course! It is even very easy, but only if the FDA would finally decide to get rid of SAS-XPT, this ancient binary format with all its stupid limitations.

    Already some years ago, the CDISC XML Technologies Team developed the Dataset-XML standard, as a simple replacement for SAS-XPT. The FDA did a pilot, but since then nothing has happened - "business as usual" seems to have returned.
    Dataset-XML was developed to allow the FDA a smooth transition from XPT to XML. It doesn't change anything to SDTM, it just changes the way the data is transported from A to B. However, Dataset-XML has the potential to do things better, as it isn't bound to the two-dimensional table approach of XPT (which again forces SDTM to be 2-dimensional tables).

    So, let's try to do better!

    Suppose that I do have a VS dataset with a systolic blood pressure for subject "CDISC01.100008" and the data point was retrieved from the EHR of the patient. Forget about adding the EHR data point in the SDTM using ancient SAS-XPT! We need Dataset-XML.

    This is how the SDTM records look:

    Now, the EHR is based on the new HL7-FHIR standard, and the record is very similar to the one at  How do we get this data point in our SDTM?

    Dataset-XML, as it is based on CDISC ODM, is extensible. This means that XML data from other sources can be embedded as long as the namespace of the embedded XML is different from the ODM namespace. As FHIR has an XML implementation, the FHIR data point can easily be embedded into the Dataset-XML SDTM record.

    In the following example (which you can download from here), I decided to add the FHIR-EHR data point to the SDTM record, and not to VSORRES (for which one could plead), as I think that the data point belongs to the record, and not to the "original result" - we will discuss this further on.

    The SDTM record then becomes:

    Remark that the "Observation" element "lives" in the HL7 namespace "".

    continued by:

    Important here is that LOINC coding is used for an exact description of the test (systolic, sitting - LOINC code 8459-0), and that SNOMED-CT is used for coding the body part. This is important - the SDTM and CT teams are still refusing to allow the LOINC code to be used as the unique identifier for the test in VS and LB. Instead, they reinvented the wheel and developed their own list of codes, leading to ambiguities. LOINC coding is mandated to be used in most national EHR systems, including the US Meaningful Use. The same applies to the use of UCUM units.

    Now, if you inspect the record carefully, you will notice that a good amount of the information is present twice. The only information that is NOT in the EHR datapoint is STUDYID, USUBJID (although,..), DOMAIN, VISITNUM, VISITDY (planned study day) and VSDY (actual day). STUDYID is an artefact of SAS-XPT, as ODM/Dataset-XML could allow to group all records per subject (using ODM "SubjectData/@SubjectKey). DOMAIN is also an artefact, as within the data set, DOMAIN must always be "VS" and is given by the define.xml anyway with a reference to the correct file.VSDY is derived and can easily be calculated "on the fly" by the FDA tools. Even VSSEQ is artificial and could easily be replaced by a worldwide unique identifier (making it worldwide referenceable, as in ... FHIR). VISIT (name) is also derived in the case of a planned visit and can be looked up in TV (trial visits).

    So, if we allow Dataset-XML to become more-dimensional (grouping data by subject), the only SDTM variables that explicitely need to be present are VISITNUM and VISITDY. So essentially, our SDTM record could be reduced to:


    Remark the annotations I made, making the mapping to SDTM variables.

    If the reviewer still likes to see the record in the classic two-dimensional table way, that's piece of cake, an extremely simple transformation (e.g. using XSLT) does the job.

    Now, reviewers always complain about file sizes (however, reviewers should be forbidden to use "files"), and will surely do when they see how much "size" the FHIR information takes. But who says that the FHIR information must be in the same file? Can't it just be referenced, or better, can't we state where the information can be found using a secured RESTful web service?
    This is done all the time in FHIR! So we could further reduce our SDTM record to:

    Remark that the "http://..." is not simply an HTTP addres: just using it in a browser will not allow to obtain the subject's data point. The RESTful web service in our case will require authentication, usually using the OAuth2 authenticion mechanism.

    Comments are very welcome - as ever ...

    Tuesday, May 3, 2016

    Ask SHARE - SDTM validation rules in XQuery

    This weekend, after returning from the European CDISC Interchange (where I gave a presentation titled "Less is more - A Visionary View of the Future of CDISC Standards"), I continued my work on the implementation of the SDTM validation rules in the open and vendor-neutral XQuery language (also see earlier postings here and here).
    This time, I worked on a rule that is not so easy. It is the PMDA rule CT2001: "Variable must be populated with terms from its CDISC controlled terminology codelist. New terms cannot be added into non-extensible codelists".
    This looks like an easy one on first sight, but it isn't. How does a machine-executable rule know whether a codelist is extensible or not? Looking into an Excel worksheet is not the best way (also as Excel is not a vendor-neutral standard) and cumbersome to program (if possible at all). So we do need something better.

    So we developed the human-readable, machine-executable rule (it can be found here) using the following algorithm:

    • the define.xml is taken and iteration is performed over each dataset (ItemGroupDef)
    • within the dataset definition, an iteration is performed over all the defined variables (ItemRef/ItemDef) and it is looked whether there is a codelist is attached to the variable
    • if a codelist is attached, the NCI code of the codelist is taken. A web service request "Get whether a CDISC-CT Codelist is extensible or not" is triggered which returns whether the codelist is extensible or not. Only the codelists that are not extensible are retained. This leads to a list of non-extensible codelists for each dataset
    • the next step could be that each of the (non-extensible) codelist is inspected for whether it has some "EnumeratedItem" or "CodeListItem" elements that have the flag 'def:ExtendedValue="Yes"'. This is however not "bullet proof" as sponsors may have added terms and forgot to add the flag. 
    • the step could also have been to use the web service "Get whether a coded value is a valid value for the given codelist" to query whether each of the values in the codelist is really a value of that codelist as published by CDISC. This relies on that the values in the dataset itself for the given variable are all present in the codelist (enforced by another rule). The XQuery implementation can be found here.
    • we choose however to inspect each value in the dataset of the given variable which has a non-extensible codelist for whether it is an allowed value for that codelist by using the web service "Get whether a coded value is a valid value for the given codelist". If the answer from the web service is "false", an error is returned in XML format (allowing reuse in other applications). The XQuery implementation can be found here.
    For each of the XQuery implementations, you can inspect them either using NotePad or NotePad++, or a modern XML editor like Altova XML Spy, oXygen-XML or EditiX.

    A partial view is below:

    What does this have to do with SHARE?


    All of the above mentioned RESTful web services (available at are based on SHARE content. The latter has been implemented as a relational database (it could however also have been a native XML database) and a pretty large number of RESTful web services has been build around it.

    In future, the SHARE API will deliver such web services directly from the SHARE repository. So our own web services are only a "test bed" for finding out what is possible with SHARE.

    So in future, forget about "black box" validation tools - simply "ask SHARE"!

    Monday, March 28, 2016

    FDA SDTM validation rules, XQuery and the "Smart Dataset-XML Viewer"

    During the last days, I could again make considerable progress in writing FDA and CDISC SDTM and ADaMvalidation rules in the vendor-neutral XQuery language (a W3C standard).

    With this project, we aim to:
    • come to a real vendor neutral, as well human-readable as machine-executable set of validation rules (no black-box implementations anymore)
    • have rules that are easily readable by persons in the CDISC community, and commented on
    • develop rules that do not lead to false positives
    • come to a reference implementation of the validation rules, meaning that, after acceptance by CDISC, other implementations (e.g. from commercial vendors) always need to come to the same result for the same test case
    • make these rules available by CDISC SHARE for applications and humans, by using RESTful web services and the SHARE API
    I was now also able to implement these rules in the "Smart Dataset-XML Viewer":

    The set of rules itself is provided as an XML file, for which we have already a RESTful web service for rapid updates, meaning that if someone finds a bug or an issue with a rule implementation, it can updated within hours, and the software can automatically retrieve the corrected rule implementation of the rule (no more waiting for the next software release or software bug fix).

    In the "Smart Dataset-XML Viewer", the validation is optional, and when the user clicks the button "Validation Rules Selections", all the available rules are listed, and can be selected/deselected, meaning that the user (and not the software) decides for which rules the submission data sets are validated:

    Some of these rules use web services themselves, for example to detect whether an SDTM variable is "required", "expected" or "permissible", something that cannot be obtained from the define.xml.
    A great advantage is that any rule violations are immediately visible in the viewer itself, i.e. the user does not need to retrieve the information from an Excel file anymore and then look up the record manually in the data set.

    At the same time, all violations are gathered into an XML structure, which can easily be (re)used in other applications (we do not consider Excel as a suitable information exchange format between software applications).

    And even better, all this is real "open source" without any license or redistribution limitations, so that people can integrate the "Smart Dataset-XML Viewer", including its XQuery validation, into any other application, even commercial ones.

    I am currently continueing working on this implementation, and on the validation rules in XQuery. I did most of the FDA-SDTM rules (well, at least those that are not wrong,ununderstandable or an expectation rather than a rule).
    I also did about 40% of the ADaM 1.3validation checks, and will start on the CDISC SDTM conformance rules as soon as they are officially published by CDISC.
    I can however use help with the ADaM validation rules, as I lack some suitable real-life test files. So if you do ADaM validation in your company and have some basic XQuery knowledge (or willing to acquire it), please let me know, so that we can make rapid progress on this.
    Another nice thing about having the rules in XQuery is that companies can easily start developing their own sets of validation rules in this vendor-neutral language, be it for SDTM, SEND or ADaM, and just add them to a specific directory in the "Smart Dataset-XML Viewer", after they will immediately become available to the viewer.

    I hope to make a first release on SourceForge (application + source code) in the next few weeks, so stay tuned!




    Thursday, February 25, 2016

    Phil's webinar on LOINC and CDISC

    Today, I attended an excellent "CDISC members only Webinar" given by Phil Pochon (Covance) on LOINC and it's use in CDISC.
    Phil is an early contributor to CDISC (considerably longer than I am), and very well known for the development of the CDISC Lab Standard and his contributions to SDTM.

    So Phil is one of the people I highly respect.

    Phil explained the concepts of LOINC very well and especially the differences with the CDISC controlled terminology for lab tests and results.

    Also he answered the questions that were posed extremely  well, giving his opinion about how LOINC should be used in combination with CDISC standards (there isn't a CDISC policy on this yet).

    In this blog, I want to extend on some of the questions that were posed and on which I have a different opinion (Phil knows that).

    There were several questions about how to map information from local labs (such as local lab test codes) to LOINC. Phil gave some suggestions about looking at the units used, the method provided (if any), and so on.
    My opinion about this is: "don't": if the lab cannot provide you the LOINC code with the test result, don't try to derive it. Even when "LBLOINC" would become "expected" in SDTM in the future, I would suggest not to try to derive it (SDTM is about captured data, not about derived data). Reason is that such a derivation may lead to disaster, also because the reviewer at the FDA cannot see whether the given LOINC code comes from the instrument that did the measurement, or was "guessed" by the person that created the SDTM files. This is a serious data quality issue.

    There was a short discussion about whether labs should provide the LOINC code together with each test result for each measurement. My opinion about that is that if your lab cannot provide LOINC codes, you should not work with that lab anymore. Also, sponsors and CROs should have a statement in their data transfer agreements with labs that the latter should not only deliver "a" LOINC code with each result, but should deliver "the correct" LOINC code with each result.

    Phil also answered a question about having LOINC codes specified for lab test in the protocol. He stated that this is a long-term goal. I would however state that sponsors should start doing this now. Even if not all labs can always provide what (LOINC code) is described in the protocol, giving the (expected/preferred/suggested) LOINC code in the protocol would immediately increase data quality, as the bandwith of what is finally delivered would surely become smaller. For example, for "glucose in urine", there is a multitude of lab test, ranging from quantitative to ordinal to qualitative, each with a different LOINC code. It is impossible to bring all these results to a single "standardized value" (this is required by SDTM). Providing the (expected/preferred/suggested) LOINC code in the protocol and passing this information to the labs would at least reduce the breadth of different tests that were actually done, making the mapping to SDTM considerably easier, and at the same time improving data quality considerably.

    An interesting question was whether LBLOINC applies to LBORRES (original result) or to LBSTRES(C/N) (standardized result). I checked it in the SDTM and it is not specified there. It still states "Dictionary-derived LOINC code for LBTEST", which is a disaster definition as LBLOINC should be taken from the data transfer itself, and not be derived (probably leading to disaster.)
    If I understood it well, Phil suggested to apply it to the "standardized" result. For example, if I obtain the glucose value in "moles/l" (e.g. LOINC code 22705-8) but standardize on "mg/dL", this would mean that I need to (automatically?) transform this LOINC code into the alternative on in "mg/dL" which is (I think) 5792-7.
    In my opinion, one should not do this, but use the LOINC code that was delivered by the lab, so on the original result. Why? Let me take an example: Suppose one of the labs has delivered the value as "ordinal", so using values like +1, +2, etc.. (LOINC code 25428-4). How can I ever standardize these to a concentration? If I can, I guess this is highly arbitrary, and thus leads to another decrease in data quality. So I would propose that LBLOINC is always given as the value that is provided by the lab (so original results) and that that is clearly explained in the SDTM-IG.

    Another interesting discussion was on the use of UCUM unit notation. According to Phil, most of the lab units published by the CDISC team in the past are identical with or very similar to the UCUM notation. My experience is different. What was very interesting is that Phil told that when they (the CDISC-CT team) receive a "new term request" for a lab unit, they first look into UCUM to see whether there is one there, and if so, take that one. I am very glad about that!
    But Phil also told that they get many requests for new unit terms that do not fit into the UCUM system (like arbitrary units for some special tests), so that they then develop their own one.
    Personally, I think they shouldn't. If a unit term is so special that it does not fit with UCUM, and also cannot be handled by so-called "UCUM annotations" (CDISC should standardize on the annotations, not on the units), then I wonder whether the requested term is good and generic enough to be standardized on at all. After all, the [UNIT] codelist is extensible.

    My personal opinion is still that CDISC should stop the development of lab test (code) terminology and steadily move to LOINC completely. For example, it is my opinion that it should now already be allowed to simply put the LOINC code in LBTESTCD (maybe with an "L" in front, as test codes are still not allowed to start with a number, a rule stemming from the punchcard era...).