Saturday, June 4, 2016

MedDRA rules validation in XQuery and the Smart Dataset-XML Viewer

Yesterday, I accomplished something that I believed was difficult, but after all wasn't: to develop the FDA and PMDA MedDRA validation rules in XQuery (it's easy if you know how).

The problem with MedDRA is that it is not open and public - you need a license. After you got one (I got a free one as I, as a professor in medical informatics, use MedDRA for research. Once you have the license, you can download the files. When I did, I expected some modern file format like XML or JSON or so, but to my surprise, the files come as oldfashioned ASCII (".asc") files with the "$" character as field separator. From the explanations that come with the files, it is said that the files can be used to build a relational database. However, the license does not allow me to redistribute the information in any form, so I could not build a RESFful web service that could then be used in the validation. As also the other validator just uses the ".asc" files "as is", I needed to find out how XQuery can read ASCII files that do not represent any XML.
I regret that MedDRA is not open and free for everyone (CDISC controlled terminology is). How can we ever empower patients to report adverse events when each patient separately needs to apply for a MedDRA license? This model is not of this time anymore...

The FDA and PMDA each contain about 20 rules that involve MedDRA. One of them is i.m.o. not implementable in software. Rule FDAC165/(PMDA)SD2006 states "MedDRA coding info should be populated using variables in the Events General Observation Class, but not in SUPPQUAL domains". How the hell can a software know whether a variable in a SUPPQUAL domain has to do with MedDRA? The only way I can see is that there is codelist attached to that variable pointing to MedDRA. If this is not the case, one can only guess (something computers are not so good in).

As MedDRA files are text files that do not represent XML, we cannot use the usual XQuery constructs to read them. Fortunately, XQuery 3.0 comes with the function "unparsed-text-lines()" which (among others) takes a file address as an argument. The file address however needs to be formatted as a URI, e.g.:


This function reads the file line by line. If it is then combined with the function "tokenize" which split strings in tokens based on a field separator, then XQuery can also easily read such oldfashioned text files. So the beginning of our XQuery file (here for rule FDAC350/SD2007), after all the namespace and functions declarations, looks like:

The first five lines in this part (18-22) define.the location of the define.xml file and of the MedDRA pt.asc (preferred terms file). For each of them, we use a "base" as we later want to enable that these are passed from an external program.

In line 24, the file is parsed, the result is an array of strings "$lines". In lines 26-29, we select the first item in each line (with the "$" character as the field separator). As such "$ptcodes" now simply consists of all the PT codes (preferred term codes).

Then, the define.xml file is read, and the AE, MH and CE dataset definitions are selected:

An iteration is started over the AE, MH and CE datasets (note that the selection allows for "splitted" datasets), and in each of them, the define.xml OID is captured of the --OID is captured, together with the variable name (which can be "AEPTCD", "MHPTCD" or "CEPTCD"). The location of the dataset is then obtained from the "def:leaf" element.

In the next part, we iterate over each record in the dataset, and get the value of the --PTCD variable (line 50):

and then check whether the value captured is one in the list of allowed PT codes (line 53). If it is not, an error message is generated (lines 53-55).

That's it! Once you know how it works, it is so easy: it took me about less than 15 minutes to develop each of these 20 rules.

I talked about these XQuery rules implementations with an FDA representative at the European CDISC Interchange in Vienna. When he saw the demo, his face became slightly pale, and he asked me: "Do you know what we paid these guys to implement our rules in software, and you tell me your implementation comes for free and is fully transparent?".

Beyond free, open and fully transparent (if that were not sufficient) the advantage of these rules is that the rules are completely independent of the software to do the validation: anyone can now write his own software without needing to code the rules themselves. You could even create a server that validates your complete repository of submissions during night time. As the messages come as XML, you can easily reuse them in any application that you want (try this with Excel!).

In the next section, I would like to explain how extremely easy it is to write software for executing the validations. The "Smart Dataset-XML Viewer" allows you to do these validations (but you can choose not do do any validation at all, or only for some rules), so I just took a few code snippets to explain this. We use the well-known open source Saxon library for XML parsing and validation, developed by the XML-guru Michael Kay, which is both available for Java and for C# (.NET). If you would like to see the complete implementation  of our code, just go to the SourceForge site, where you can download the complete source code of just browse through. The most interesting class is the class "XQueryValidation" in the package "edu.fhjoanneum.ehealth.smartdatasetxmlviewer.xqueryvalidation".
Here is a snippet:

First of all, the file location of the define.xml is transformed to a "URI". A new StringBuffer is prepared to keep all the messages. In the following lines, the Saxon XQuery engine is initialized

and the base location and file name of the define.xml file is passed to the XQuery engine (remark that the define.xml can also be located in a native XML database, with one collection per submission, something that also the FDA and PMDA could easily do). This "passing" is done in the lines with "exp.bindObject" (in the center of the snippet).
In case MedDRA is involved in the rule execution, the same is done in the last part of the snippet (whether a rule required MedDRA is given by the "requiresMedDRA" attribute in the XML file containing all the rules:

The rule is then executed, and the error messages (as XML strings) captured in the earlier defined StringBuffer:

So, the contents of the "messageBuffer" StringBuffer is essentially an XML that can be parsed, or just written to file, or transformed to a table, or stored in a native XML database, or ...

In order to accept the passing of parameters from the program to the XQuery rule, we only need to change the hardcoded file paths and locations to "external" ones, i.e. stating that some program will be responsible for passing the information. In the XQuery itself, this is done by:

As one sees, lines 17-19 have been commented out, and lines 14-16 are lines 14-16 declare that the values for the location of the define.xml file and of the directory with MedDRA files will come from a calling program.

In the "Smart Dataset-XML Viewer", the user can himself decide where the MedDRA files are located (so it is not necessary to copy files to the directory of the application), using the button "MedDRA Files Directory":

A directory chooser than shows up, allowing to set where the MedDRA files need to be read from. This can also be a network drive, as is pretty usual in companies.

If you are interested in implementing these MedDRA validation rules, just download them from our website, or use the RESTful web service to get the latest update.

Again, the "Smart Dataset-XML Viewer" is completely "open source". Please feel free to use the source code, to extend it, to use parts of it in your own applications, to redistribute it with your own applications, etc.. Of course, we highly welcome it when you also donate source code of extensions that you wrote back, so that we can further develop this software.

Friday, May 20, 2016

Electronic Health Record Data in SDTM

The newest publication of the FDA about "Use of Electronic HealthRecord Data in Clinical Investigations" triggered me to pick up the topic of EHRs in SDTM again. The FDA publication describes the framework in which use of EHRs in clinical research is allowed and encouraged. Although it does not contain really new information, it should take the fears of sponsors and investigators away for use of EHR data in FDA regulated clinical trials.

One of the things that will surely happen in future is that the FDA reviewer wants to see the EHR data point that was used as the source of a data point in the SDTM submission. The investigator will then ask the sponsor who will then ask the site...: another delay in bringing this innovatine new drug or therapy to the market. In the mean time patients will die ...

So can't we have the EHR datapoint in the SDTM itself?

Of course! It is even very easy, but only if the FDA would finally decide to get rid of SAS-XPT, this ancient binary format with all its stupid limitations.

Already some years ago, the CDISC XML Technologies Team developed the Dataset-XML standard, as a simple replacement for SAS-XPT. The FDA did a pilot, but since then nothing has happened - "business as usual" seems to have returned.
Dataset-XML was developed to allow the FDA a smooth transition from XPT to XML. It doesn't change anything to SDTM, it just changes the way the data is transported from A to B. However, Dataset-XML has the potential to do things better, as it isn't bound to the two-dimensional table approach of XPT (which again forces SDTM to be 2-dimensional tables).

So, let's try to do better!

Suppose that I do have a VS dataset with a systolic blood pressure for subject "CDISC01.100008" and the data point was retrieved from the EHR of the patient. Forget about adding the EHR data point in the SDTM using ancient SAS-XPT! We need Dataset-XML.

This is how the SDTM records look:

Now, the EHR is based on the new HL7-FHIR standard, and the record is very similar to the one at  How do we get this data point in our SDTM?

Dataset-XML, as it is based on CDISC ODM, is extensible. This means that XML data from other sources can be embedded as long as the namespace of the embedded XML is different from the ODM namespace. As FHIR has an XML implementation, the FHIR data point can easily be embedded into the Dataset-XML SDTM record.

In the following example (which you can download from here), I decided to add the FHIR-EHR data point to the SDTM record, and not to VSORRES (for which one could plead), as I think that the data point belongs to the record, and not to the "original result" - we will discuss this further on.

The SDTM record then becomes:

Remark that the "Observation" element "lives" in the HL7 namespace "".

continued by:

Important here is that LOINC coding is used for an exact description of the test (systolic, sitting - LOINC code 8459-0), and that SNOMED-CT is used for coding the body part. This is important - the SDTM and CT teams are still refusing to allow the LOINC code to be used as the unique identifier for the test in VS and LB. Instead, they reinvented the wheel and developed their own list of codes, leading to ambiguities. LOINC coding is mandated to be used in most national EHR systems, including the US Meaningful Use. The same applies to the use of UCUM units.

Now, if you inspect the record carefully, you will notice that a good amount of the information is present twice. The only information that is NOT in the EHR datapoint is STUDYID, USUBJID (although,..), DOMAIN, VISITNUM, VISITDY (planned study day) and VSDY (actual day). STUDYID is an artefact of SAS-XPT, as ODM/Dataset-XML could allow to group all records per subject (using ODM "SubjectData/@SubjectKey). DOMAIN is also an artefact, as within the data set, DOMAIN must always be "VS" and is given by the define.xml anyway with a reference to the correct file.VSDY is derived and can easily be calculated "on the fly" by the FDA tools. Even VSSEQ is artificial and could easily be replaced by a worldwide unique identifier (making it worldwide referenceable, as in ... FHIR). VISIT (name) is also derived in the case of a planned visit and can be looked up in TV (trial visits).

So, if we allow Dataset-XML to become more-dimensional (grouping data by subject), the only SDTM variables that explicitely need to be present are VISITNUM and VISITDY. So essentially, our SDTM record could be reduced to:


Remark the annotations I made, making the mapping to SDTM variables.

If the reviewer still likes to see the record in the classic two-dimensional table way, that's piece of cake, an extremely simple transformation (e.g. using XSLT) does the job.

Now, reviewers always complain about file sizes (however, reviewers should be forbidden to use "files"), and will surely do when they see how much "size" the FHIR information takes. But who says that the FHIR information must be in the same file? Can't it just be referenced, or better, can't we state where the information can be found using a secured RESTful web service?
This is done all the time in FHIR! So we could further reduce our SDTM record to:

Remark that the "http://..." is not simply an HTTP addres: just using it in a browser will not allow to obtain the subject's data point. The RESTful web service in our case will require authentication, usually using the OAuth2 authenticion mechanism.

Comments are very welcome - as ever ...

Tuesday, May 3, 2016

Ask SHARE - SDTM validation rules in XQuery

This weekend, after returning from the European CDISC Interchange (where I gave a presentation titled "Less is more - A Visionary View of the Future of CDISC Standards"), I continued my work on the implementation of the SDTM validation rules in the open and vendor-neutral XQuery language (also see earlier postings here and here).
This time, I worked on a rule that is not so easy. It is the PMDA rule CT2001: "Variable must be populated with terms from its CDISC controlled terminology codelist. New terms cannot be added into non-extensible codelists".
This looks like an easy one on first sight, but it isn't. How does a machine-executable rule know whether a codelist is extensible or not? Looking into an Excel worksheet is not the best way (also as Excel is not a vendor-neutral standard) and cumbersome to program (if possible at all). So we do need something better.

So we developed the human-readable, machine-executable rule (it can be found here) using the following algorithm:

  • the define.xml is taken and iteration is performed over each dataset (ItemGroupDef)
  • within the dataset definition, an iteration is performed over all the defined variables (ItemRef/ItemDef) and it is looked whether there is a codelist is attached to the variable
  • if a codelist is attached, the NCI code of the codelist is taken. A web service request "Get whether a CDISC-CT Codelist is extensible or not" is triggered which returns whether the codelist is extensible or not. Only the codelists that are not extensible are retained. This leads to a list of non-extensible codelists for each dataset
  • the next step could be that each of the (non-extensible) codelist is inspected for whether it has some "EnumeratedItem" or "CodeListItem" elements that have the flag 'def:ExtendedValue="Yes"'. This is however not "bullet proof" as sponsors may have added terms and forgot to add the flag. 
  • the step could also have been to use the web service "Get whether a coded value is a valid value for the given codelist" to query whether each of the values in the codelist is really a value of that codelist as published by CDISC. This relies on that the values in the dataset itself for the given variable are all present in the codelist (enforced by another rule). The XQuery implementation can be found here.
  • we choose however to inspect each value in the dataset of the given variable which has a non-extensible codelist for whether it is an allowed value for that codelist by using the web service "Get whether a coded value is a valid value for the given codelist". If the answer from the web service is "false", an error is returned in XML format (allowing reuse in other applications). The XQuery implementation can be found here.
For each of the XQuery implementations, you can inspect them either using NotePad or NotePad++, or a modern XML editor like Altova XML Spy, oXygen-XML or EditiX.

A partial view is below:

What does this have to do with SHARE?


All of the above mentioned RESTful web services (available at are based on SHARE content. The latter has been implemented as a relational database (it could however also have been a native XML database) and a pretty large number of RESTful web services has been build around it.

In future, the SHARE API will deliver such web services directly from the SHARE repository. So our own web services are only a "test bed" for finding out what is possible with SHARE.

So in future, forget about "black box" validation tools - simply "ask SHARE"!

Monday, March 28, 2016

FDA SDTM validation rules, XQuery and the "Smart Dataset-XML Viewer"

During the last days, I could again make considerable progress in writing FDA and CDISC SDTM and ADaMvalidation rules in the vendor-neutral XQuery language (a W3C standard).

With this project, we aim to:
  • come to a real vendor neutral, as well human-readable as machine-executable set of validation rules (no black-box implementations anymore)
  • have rules that are easily readable by persons in the CDISC community, and commented on
  • develop rules that do not lead to false positives
  • come to a reference implementation of the validation rules, meaning that, after acceptance by CDISC, other implementations (e.g. from commercial vendors) always need to come to the same result for the same test case
  • make these rules available by CDISC SHARE for applications and humans, by using RESTful web services and the SHARE API
I was now also able to implement these rules in the "Smart Dataset-XML Viewer":

The set of rules itself is provided as an XML file, for which we have already a RESTful web service for rapid updates, meaning that if someone finds a bug or an issue with a rule implementation, it can updated within hours, and the software can automatically retrieve the corrected rule implementation of the rule (no more waiting for the next software release or software bug fix).

In the "Smart Dataset-XML Viewer", the validation is optional, and when the user clicks the button "Validation Rules Selections", all the available rules are listed, and can be selected/deselected, meaning that the user (and not the software) decides for which rules the submission data sets are validated:

Some of these rules use web services themselves, for example to detect whether an SDTM variable is "required", "expected" or "permissible", something that cannot be obtained from the define.xml.
A great advantage is that any rule violations are immediately visible in the viewer itself, i.e. the user does not need to retrieve the information from an Excel file anymore and then look up the record manually in the data set.

At the same time, all violations are gathered into an XML structure, which can easily be (re)used in other applications (we do not consider Excel as a suitable information exchange format between software applications).

And even better, all this is real "open source" without any license or redistribution limitations, so that people can integrate the "Smart Dataset-XML Viewer", including its XQuery validation, into any other application, even commercial ones.

I am currently continueing working on this implementation, and on the validation rules in XQuery. I did most of the FDA-SDTM rules (well, at least those that are not wrong,ununderstandable or an expectation rather than a rule).
I also did about 40% of the ADaM 1.3validation checks, and will start on the CDISC SDTM conformance rules as soon as they are officially published by CDISC.
I can however use help with the ADaM validation rules, as I lack some suitable real-life test files. So if you do ADaM validation in your company and have some basic XQuery knowledge (or willing to acquire it), please let me know, so that we can make rapid progress on this.
Another nice thing about having the rules in XQuery is that companies can easily start developing their own sets of validation rules in this vendor-neutral language, be it for SDTM, SEND or ADaM, and just add them to a specific directory in the "Smart Dataset-XML Viewer", after they will immediately become available to the viewer.

I hope to make a first release on SourceForge (application + source code) in the next few weeks, so stay tuned!

Thursday, February 25, 2016

Phil's webinar on LOINC and CDISC

Today, I attended an excellent "CDISC members only Webinar" given by Phil Pochon (Covance) on LOINC and it's use in CDISC.
Phil is an early contributor to CDISC (considerably longer than I am), and very well known for the development of the CDISC Lab Standard and his contributions to SDTM.

So Phil is one of the people I highly respect.

Phil explained the concepts of LOINC very well and especially the differences with the CDISC controlled terminology for lab tests and results.

Also he answered the questions that were posed extremely  well, giving his opinion about how LOINC should be used in combination with CDISC standards (there isn't a CDISC policy on this yet).

In this blog, I want to extend on some of the questions that were posed and on which I have a different opinion (Phil knows that).

There were several questions about how to map information from local labs (such as local lab test codes) to LOINC. Phil gave some suggestions about looking at the units used, the method provided (if any), and so on.
My opinion about this is: "don't": if the lab cannot provide you the LOINC code with the test result, don't try to derive it. Even when "LBLOINC" would become "expected" in SDTM in the future, I would suggest not to try to derive it (SDTM is about captured data, not about derived data). Reason is that such a derivation may lead to disaster, also because the reviewer at the FDA cannot see whether the given LOINC code comes from the instrument that did the measurement, or was "guessed" by the person that created the SDTM files. This is a serious data quality issue.

There was a short discussion about whether labs should provide the LOINC code together with each test result for each measurement. My opinion about that is that if your lab cannot provide LOINC codes, you should not work with that lab anymore. Also, sponsors and CROs should have a statement in their data transfer agreements with labs that the latter should not only deliver "a" LOINC code with each result, but should deliver "the correct" LOINC code with each result.

Phil also answered a question about having LOINC codes specified for lab test in the protocol. He stated that this is a long-term goal. I would however state that sponsors should start doing this now. Even if not all labs can always provide what (LOINC code) is described in the protocol, giving the (expected/preferred/suggested) LOINC code in the protocol would immediately increase data quality, as the bandwith of what is finally delivered would surely become smaller. For example, for "glucose in urine", there is a multitude of lab test, ranging from quantitative to ordinal to qualitative, each with a different LOINC code. It is impossible to bring all these results to a single "standardized value" (this is required by SDTM). Providing the (expected/preferred/suggested) LOINC code in the protocol and passing this information to the labs would at least reduce the breadth of different tests that were actually done, making the mapping to SDTM considerably easier, and at the same time improving data quality considerably.

An interesting question was whether LBLOINC applies to LBORRES (original result) or to LBSTRES(C/N) (standardized result). I checked it in the SDTM and it is not specified there. It still states "Dictionary-derived LOINC code for LBTEST", which is a disaster definition as LBLOINC should be taken from the data transfer itself, and not be derived (probably leading to disaster.)
If I understood it well, Phil suggested to apply it to the "standardized" result. For example, if I obtain the glucose value in "moles/l" (e.g. LOINC code 22705-8) but standardize on "mg/dL", this would mean that I need to (automatically?) transform this LOINC code into the alternative on in "mg/dL" which is (I think) 5792-7.
In my opinion, one should not do this, but use the LOINC code that was delivered by the lab, so on the original result. Why? Let me take an example: Suppose one of the labs has delivered the value as "ordinal", so using values like +1, +2, etc.. (LOINC code 25428-4). How can I ever standardize these to a concentration? If I can, I guess this is highly arbitrary, and thus leads to another decrease in data quality. So I would propose that LBLOINC is always given as the value that is provided by the lab (so original results) and that that is clearly explained in the SDTM-IG.

Another interesting discussion was on the use of UCUM unit notation. According to Phil, most of the lab units published by the CDISC team in the past are identical with or very similar to the UCUM notation. My experience is different. What was very interesting is that Phil told that when they (the CDISC-CT team) receive a "new term request" for a lab unit, they first look into UCUM to see whether there is one there, and if so, take that one. I am very glad about that!
But Phil also told that they get many requests for new unit terms that do not fit into the UCUM system (like arbitrary units for some special tests), so that they then develop their own one.
Personally, I think they shouldn't. If a unit term is so special that it does not fit with UCUM, and also cannot be handled by so-called "UCUM annotations" (CDISC should standardize on the annotations, not on the units), then I wonder whether the requested term is good and generic enough to be standardized on at all. After all, the [UNIT] codelist is extensible.

My personal opinion is still that CDISC should stop the development of lab test (code) terminology and steadily move to LOINC completely. For example, it is my opinion that it should now already be allowed to simply put the LOINC code in LBTESTCD (maybe with an "L" in front, as test codes are still not allowed to start with a number, a rule stemming from the punchcard era...).

Saturday, November 7, 2015

Submission standards and monopolies

This morning and the last week was frustrating in one aspect. I had to retrieve some personal information from the Austrian tax portal "Finanzonline". It came to mind as I received a letter from the authorities that I hadn't reacted to a message to me on that portal (I didn't even know that there was such a message, as there never has been any E-mail notification). Now I had to reset my password for some reasons, so I tried and expected to obtain an E-mail allowing to do the final step. Instead I got the message that another paper letter would be send to me. I wasn't at home when the postman wanted to deliver it, so I had to pick it up at the post office: the postman was not allowed to throw it in my post box (as it was "official"). To my surprise I did not get a new password - I got two.
It then took me another 15 minutes to get the portal entry working, among others as it wanted me to change these two passwords and required me to enter additional information, i.e. my social security number which is known to them (beyond username and the two passwords). Finally I got into the portal, read the message, saying I needed to provide a few documents. I looked for a PDF upload button. There wasn't. I looked for an e-mail address (including in the letters I got) where I could send the PDFs to. There wasn't either. The only thing I found was a phone number: if I had questions I would need to use that (phone is much more secure than E-mail isn't it? Just ask Mrs Merkel, Germany's chancellor). Essentially, I experienced the portal as one of the top 3 most user-unfriendly systems I ever used (the two others are the German tax portal, which is even worse, and the SAP portal at our university)

So it came as a surprise when a colleague told me that this tax portal is a prize winning portal! I googled somewhat and found the following announcement on the portal's website itself:

Even if your German is good (mine is), can you read it? I can't - they even did not get the encoding of German characters right ... If you would get such a product description on Amazon, would you still buy from them? Looks like a prize winner in the land of the half blind...

Now, what does this have to do with SDTM submission standards?

Just like the tax authorities here, the regulatory authorities for new medications and therapies have a monopoly. They do not need to compete like the Chicago Scubs in basebal (see Wayne's blog). So there is no real drive for IT innovation, except sometimes from the general public over the parliament (see e.g. the article in Medpage Today).

I my not so short life, I have always been one of the first trying out new IT systems when they became available that aimed to make people's life easier. I started using the internet at the time you still needed a phone modem, and transferring a 10MB dataset took a whole day. I started using internet banking at the time that less of 1% of the customers of my bank did. So I gained quite some experience. My experience with IT systems of organizations that have a monopoly is the following:
  • They haven't got a working portal for their "customers" (example: "Deutsche Rentenversicherung")
  • If they have, it is often extremely user-unfriendly (example: the Austrian tax portal)
  • Within the organization, they use outdated technology
  • They generally distrust e-mail 
  • No publicly described web services are available
  • They have no idea what XML and JSON is, or have prejudices against them
Recognize some things in the scope of SDTM and ADaM submissions?

Suppose these type of organizations each have a large set of web services for which the API is publicly available. Then vendors could compete on tools for interacting with the systems of these organizations. For example for the Austrian tax authorities, Austrian citizens could then choose between different portals for communicating with the authorities, the portals being owned by different companies or organizations, competing against each other in user friendlyness.

But this doesn't solve the problem of the drive to modernize...

Any exceptions on these? Just let us know...

Saturday, August 8, 2015

SDTM/SEND/ADaM Validation - how it should be done

This is my personal opinion.
It is not the opinion of CDISC, not the opinion of any of the CDISC development teams I belong or do not belong too (although I guess that some team members will probably strongly agree with it), nor the opinion of OpenCDISC (who I guess will strongly disagree).

How should validation of SDTM, SEND of ADaM datasets be done?
We have the CDISC standards specifications, the implementation guides (further abbreviated as "IG") and we have the define.xml. Furthermore we currently have two formats for the datasets themselves: SAS Transport 5 and Dataset-XML. Only the former is currently (unfortunately) being accepted by the FDA.

First let me state the most important thing of everything: Define.xml is leading.

This means that what is in the define.xml file that is part of the submission is "the truth". This file contains all metadata of the submission, whether it be an SDTM, SEND of ADaM submission.
So all submitted datasets should be validated against the metadata in the define.xml.
Now you will say "Wait a minute Jozef, being valid against the metadata in the define.xml doesn't mean that the datasets are in accordance with what is written in the corresponding IG".

That is correct! So the first step should always be to validate the contents of the define.xml against the contents of the IG. For example, if controlled terminology (CT) is expected for a specific variable, this should be declared in the define.xml, using a CodeList, and the contents of the CodeList should match the controlled terminology published by CDISC, meaning that every entry in the define.xml codelist should appear in the published controlled terminology, unless the latter has been defined as "extensible". It does not mean that the CodeList should have exactly the same contents as the CDISC-CT list, as usually only a fraction of the coded values is used in the actual submission. For example, the CDISC-CT for lab test codes contains several hundred terms, but the define.xml should only list those that have actually been used in the study.
Also the maximal lengths, labels and datatypes from the define.xml should be checked against the IG. For example maximal 8 characters for test codes, 40 for test names. E.g. if your define.xml states that the maximal length for LBTESTCD is 7, that is completely OK, when it states that the maximal length of LBTESTCD is 9, that would be a violation.
Unfortunately, the latest IGs still publish datatypes as "character" and "numeric", whereas define.xml uses much better and granular datatypes. Fortunately, the define.xml specification itself provides a mapping (see section 4.2.1 "data type considerations" of the define.xml 2.0 specification). Also other things like the order in which the variables appear can be checked at this point.

Once all the information in the define.xml has been validated against the IG and everything is OK, when can proceed with the second step: validation of the datasets against the define.xml. The largest part of this validation consist of checking whether the variable value is of the correct datatype (as defined in the define.xml), is one of the codelist provided by the define.xml (when applicable), and whether its length is not longer than defined in the define.xml. At this point, also the label for each variable (when using SAS Transport 5) can be checked against the one provided in the define.xml, which again must match the one from the IG (at least when it is not a sponsor-defined domain).

Now there are some "rules" that cannot be described in the define.xml. In most cases these are cross-domain or cross-dataset rules, although there also some "internal-domain" rules such as uniqueness of USUBJID in DM, and uniqueness of the combination of USUBJID-xxSEQ in each domain. But even then, some of the rules can be checked using define.xml: for example there is the rule that ARMCD (or ACTARMCD) must be one from the TA (trial arms) domain/dataset. In the define.xml, a codelist will be associated with the variable ARMCD for TA. So one should test whether the value of ARMCD in the TA dataset is one from the associated codelist in the define.xml (note that also e.g. "SCRNFAIL" should appear in the define.xml codelist when there was at least one). When then in another domain/dataset "planned" or "actual" ARMCD/ACTARMCD is present, and its value also corresponds to one of the entries in the codelist of define.xml, the value is valid, even without needing to do any direct cross-dataset validation.

Once validation of the datasets against the define.xml was done (step 2), one can proceed with step 3: validation against any remaining rules that cannot be described in the define.xml. Mostly these will be cross-domain rules, for example: "EX record is present, when subject is not assigned to an arm: Subjects that have withdrawn from a trial before assignment to an Arm (ARMCD='NOTASSGN') should not have any Exposure records" (FDA rule FDAC49). I took one of the FDA rules on purpose here, as it is a clear rule, and e.g. the SDTM-IG does not really provide such explicit clear rules, so validation software implementors usually base their rules on their own interpretation of the IG, which is very dangerous. It is very important that the rules are very clear and not open for different interpretations. As well the FDA as a CDISC-team have published or will be publishing sets of rules (remark that even the "FDA SDTM rules" are not always clear and some are even completely wrong in some cases). At best, such rules are as well human-readable as well as machine-interpretable and machine-executable. Unfortunately, as well the published FDA rules for SDTM are only "text", leaving them open again for different implementations. We have however already done some work on an XQuery implementation for the FDA rules, and one of our students recently implemented the CDISC rules in XQuery for the SDTM-DM domain. The plan is to extend this work, further develop and publish the XQueries in cooperation with CDISC and the FDA, and make these rules available over RESTful web services (so that they can be retrieved by computers without human intervention).

The figure below depicts (of course simplified) the process again:

The figure speaks of SDTM, but the same is equally applicable for SEND and ADaM.

Short summary:
  • step 1: for SDTM/SEND/ADaM submissions, validate the contents of the define.xml against the IG
  • step 2: validate the datasets against the define.xml
  • step 3: for all other "rules" (mostly cross-dataset/domain validation), validate the datasets against rules published by CDISC and FDA, if possible using human-readable, machine-executable rules.
Comments are of course very welcome!

Saturday, June 6, 2015

FDA starts embracing LOINC

Two friendly colleagues from pharma companies pointed me to very recent FDA publications about LOINC, the worldwide coding system for laboratory test codes in healthcare.
The first is the "Study Standards Resources" mentioning LOINC as a standard used by the FDA (better "was", as the mentioning was removed a few days later). The second, a few days later, is however much more important. It is an article in the Federal Register titled "Electronic Study Data Submissions; Data Standards; Support for the Logical Observation Identifiers names and codes".

You might want to read it before continueing reading this blog entry.

I have been "fighting" several years now for having LOINC recognized as the identifying coding system for lab tests in SDTM submissions. CDISC however developed its own coding system for lab tests, which is inconsistent and does not allow to uniquely identify tests. My formal request to put the further development of CDISC-CT-Lab codes (LBTESTCD/LBTEST) onto hold, and gradually move to LOINC, did however not make me many friends in the CDISC Controlled TerminologyTeam, at the contrary.

In their newest Federal Register publication, the FDA requests comments on their initiative, especially about "the Agency recognizes that the high level of granularity inherent in LOINC has presented coding challenges and that these challenges have led to the creation of subsets of LOINC to help facilitate coding" with the specific question:

"Should FDA identify a LOINC subset for its use case?"

I think this is a good idea. LOINC has over 72,000 test codes. LOINC itself already published a "top 2000+ LOINC codes", i.e. a list of the 2,000 most used codes. Also in Austria, where the use of LOINC coding is mandatory for use in the national electronic health record system (ELGA), a subset has been published which should preferentially be used. And then we have the very old "LOINC codes for common CDISC tests", which has unfortunately not been maintained in the last years.

Important is that such a subset needs to be a "recommendation", otherwise people will start "pushing" tests with codes that are not into the list, into one of the test codes that is in the list, thus essentially falsifying the information. If the FDA would recommend the 2000+ list, or would pick up the old CDISC list again (and modernize it), this would be a very wise step, as e.g. the 2000+ list covers  98% of all the tests done in hospitals (and probably also in research).

There is however no technical limitation to allowing the full LOINC list of test codes, as there are now RESTful Web Services available for looking up test codes and their meaning. You can find the complete list here.The National Library of Medicine even has a  LOINC web service with very detailed information about many of the test which is queried using the LOINC code. This web service, and other CDISC-CT web services, have already been implemented in the "Smart Dataset-XML Viewer", a software tool for inspecting SDTM submissions in Dataset-XML format. They can also be implemented easily in any modern software tool, including the ones used by the FDA.

The FDA starting embracing LOINC (does CDISC soon follow?) is a major step forward. I have been "fighting" several years for this, so you can imagine that my weekend is now already perfect ...

Tuesday, April 21, 2015

The DataSet-XML FDA Pilot - report and first comments

The FDA very recently published a report on their pilot of the use of Dataset-XML as an alternative to SAS-XPT (SAS Transport 5) for SDTM, SEND and ADaM submissions. Here is a short summary of this report together with my comments.

The pilot was limited to SDTM datasets. The FDA departments that were involved were CDER and CBER. The major objectives of the pilot were:
  • ensuring that data integrity was maintained when going from SAS datasets (sas7bdat) to Dataset-XML and back.
  • ensuring that Dataset-XML format supports longer variable names, labels and text fields than the one from SAS-XPT (which has limitations of 8 characters for names, 40 for labels and 200 for text fields).
Comment: Unfortunately, the following was not part of the testing in the pilot:
  • the capability of transporting non-ASCII-7 characters (another limitation of XPT)
Six sponsors were selected out of fourteen candidates. The selection criteria can be found in the report and will not be discussed here.

Two types of tests were performed:
a) test whether the Dataset-XML files can be transformed into sas7bdat and can be read by FDA data analysis software (e.g. JMP)
b) test whether data integrity is preserved when converting sas7bdat files to Dataset-XML files and then back to sas7bdat files

Comment: Although the report doesn't mention this at all, I heard that one of the submissions was generated using Java/Groovy. This also proves that correct Dataset-XML files can be generated by other tools than by statistical software (which was pretty hard with XPT). I.m.o. this is a valuable result that should be mentioned.

Further, sponsors were asked to submit Dataset-XML files that contain variable names longer than 8 characters, variable labels longer that 40 characters, and text content larger than 200 characters. The goal of this was to test whether Dataset-XML files can (sic) "facilitate a longer variable name (>8 characters), a longer label name (>40 characters) and longer text fields (>200 characters)".)

Comment: well, that is something we already know for many many years ...

Issues found by the FDA.


During the pilot, a number of issues was encountered, which could all be resolved.
  • Initially, testing was not successful due to a memory issue caused by the large dataset size. This issue was resolved after the SAS tool was updated which addressed the high memory consumption issue
Comment: Well designed modern software that is parsing XML should not use more memory than when parsing text files or SAS-XPT. See e.g. my comments about memory usage using technologies like VTD-XML in an earlier blog. It is a myth that processing large files consumes much memory. XML technologies like SAX are even known for using only small amounts of memory. The issue could however quickly be resolved by the SAS people that were cooperating in the pilot (more about that later).
  • Encoding problems in a define.xml file
Comment: this has nothing to with Dataset-XML itself. What happened was that a define.xml used curly quotes ("MS Office quotes") for the delimiters in XML attributes. Probably the define.xml was created either from copy-past from a Word document or generated from an Excel file. These "curly quotes" are non-standard and surely not supported by XML.
Generating define.xml from Excel files or Word documents is extremely bad practice. See my blog entry "Creating define.xml - best and worst practices". Ideally, define.xml files should be created even before the study starts, e.g. as a specification of what SDTM datasets are expected as a result of the study.

  • A problem with an invalid (?) variable label in the define.xml
Comment: The FDA found out that "Dataset-XML requires consistency between the data files and define.xml". Now, there is something strange with the statement about the variable label, as the latter does not appear at all in the Dataset-XML files. What I understood is that the define.xml file that came with the Dataset-XML files had one label that was not consistent with the label in the original XPT file. With Dataset-XML, define.xml becomes "leading", and that is exactly how it HAS to be. With XPT, there is a complete disconnect between the data files and the define.xml.
So yes, Dataset-XML requires that you put effort in providing a correct define.xml file (as it is leading), and that is good so.

File sizes


A major concern of the FDA is and always has been the file sizes of Dataset-XML files. Yes, XML files usually are larger than the corresponding XPT files. However, this does not have to be the case.
The largest files in an SDTM submission usually are the SUPPQUAL files.
SUPPQUAL files can be present for several reasons:
  • text values longer than 200 characters. Starting from the 201st character, everything is "banned" to SUPPQUAL records. This is not at all necessary when using Dataset-XML.
  • non-standard variables (NSVs). According to the SDTM-IG, NSVs may not appear in the "parent" dataset, but must be provided in the very inefficient SUPPQUAL datasets. The latter can then also grow quickly in size. If they would be allowed to remain in the parent dataset (and marked as an NSV in the define.xml) we would mostly not need SUPPQUAL datasets at all, and so the largest files would disappear from our submission. Unfortunately the report does not give us any information about what the largest files were.
Let's  take an example: I took a classic submission which has an LB.xpt file with laboratory data of 33MB and a SUPPLB.xpt file of 55MB. So the SUPPQUAL file SUPPLB.xpt is considerably larger although it only contains data for 2 variables (the LB.xpt file has data for 23 variables). The corresponding Dataset-XML files have sizes of 66 and 40 MB. So they are somewhat larger than the XPT files. If one now brings the 2 NSVs back to the parent records, the Dataset-XML file is 80MB in size (and there is no SUPPLB.xml file), so smaller than the sum of the LB.xpt and SUPPLB.xpt files.
Of course, one could also move NSVs to the parent dataset when using XPT.

In the pilot, the FDA observed file size increases (relative to XPT) up to 264%, and considers this to be a problem. Why?
It cannot be memory consumption when loading in modern analysis tools. As I have shown before, modern XML technologies like SAX and VTD-XML are known for their low memory consumption.
Disk costs can also not be an issue. The largest submission was 17GB in size which comes at a disk cost of 0.51 US$ (3 dollarcent per GB).
So what is the issue? Here is the citation from the report:
"Based on the file size observations, Dataset-XML produces much larger files than XPORT, which may impact the Electronic Submissions Gateway (ESG)".
OK, can't the ESG handle a 17GB submission? If not, let us zip the Dataset-XML files. Going back to my 80MB LB.xml file, when I do a simple zipping, it reduces to 2MB, so the file size is reduced by a factor 40! If I would do the same with my 17GB submission, it would reduce to a mere 425MB (for the whole submission), something the ESG can surely handle. So, what's the problem?
Wait a minute Jozef! Didn't we tell you that the ESG does not accept .zip files?
A few thoughts:
  • would the ESG accept .zzz files (a little trick we sometimes use to get zipped files through e-mail filters: just rename the .zip file into .zzz and it passes...).
  • would the ESG accept .rar files? RAR is another compression format, that is also very efficient.
  • why does the ESG not accept .zip files? We will ask the FDA. Is it fear of virusses? Also PDFs can contain virusses, and modern virus scanners can easily scan .zip files on virusses before unzipping. Or is it superstition? Or is it the misbelieve that zipping can change the file contents? 
  • modern software tools like the "Smart Dataset-XML Viewer" can parse zipped XML files without the need of unzipping the files first. Also SAS can read zipped files.
Compression of XML files is extremely efficient, so those claiming that large files sizes can lead to problems (I cannot see why) should surely use a compression method like ZIP or RAR.

 A few things that were not considered in the pilot:


  • data quality
    The combination of Dataset-XML and define.xml allow to perform better data quality checks than when using XPT. Tools can easily validate contents of the Dataset-XML against the metadata in the define.xml. With XPT this is much harder as it needs a lot of "hard coding". Although OpenCDISC supports Dataset-XML, it does not (yet) validate Dataset-XML against the information in the define.xml file (or very limited)
  • the fact that define.xml is "leading" brings a lot of new opportunities. For example, the define.xml can be used to automatically generate a relational database (e.g. by transformation into SQL "CREATE TABLE" statements), and the database can then be automatically filled from the information in the Dataset-XML files (e.g. by transformation into SQL "INSERT" statements). This is also possible with XPT, but much much harder when not having SAS available.
  • this brings us to another advantage of Dataset-XML. As it is XML, it is really an "open" and "vendor neutral" format. So software vendors, but also the FDA itself, could generate software tools to do very smart things with the contents of the submission. Also this seems not to have been considered during the course of the pilot.
  • non-ASCII support. As already stated, XPT only supports ASCII-7. This means that e.g. special Spanish characters are not supported (there are 45 Million spanish speaking people in ths US). XML can (and does by default) use UTF-8 encoding (essentially this means "unicode"), supporting a much much larger character set. This one of the main reasons why the Japanese PMDA is so interesting in Dataset-XML: XML easily supports Japanese characters where there is no support at all in ASCII-7.


What's next?


 Very little information is provided in the report. It only states:
"FDA envisages conduction several pilots to evaluate new transport formats before a decision is made to support a new format". So it might be that the FDA conducts other pilots with other proposed formats such as semantic web technology, or even maybe CSV (comma separated values).
There are also some rumours about a Dataset-XML pilot for SEND files.



The pilot was successful as the major goals were reached: ensuring data integrity during transport, and ensuring that Dataset-XML supports longer variable names, labels and text values.
The FDA keeps repeating its concerns about file sizes, although these can easily be overcome by allowing NSVs to be kept in the parent dataset, and by allowing compression techniques, which are very efficient for XML files.

Some further personal remarks


I have always found it very strange that the FDA complains about file sizes. It has been the FDA who has been asking for new derived variables in SDTM datasets (like the --DY variables) and for duplicate information (e.g. test name --TEST which has 1:1 relationship with test code --TESTCD). Derivations like --DY calculation can easily be done by the FDA tools themselves (it is also one of the features of the "Smart Dataset-XML Viewer"), and e.g. test name can easily be retrieved using a web service (see here and here). Removing these unnecessary derived or redundant variables from the SDTM would reduce the file sizes with at least 30-40%.



Special thanks are due to the SAS company and its employee Lex Jansen, who is a specialist in as well SAS as XML (well, I would even state that he is a "guru"). Lex spend a lot of time working together with the FDA people and resolving the issues. Also special thanks are due to a number of FDA people that I cannot mention by name here, for their openess and many good discussions with a number of the CDISC XML Technology team volunteers.

Friday, March 27, 2015

Creating define.xml - best and worst practices

Define.xml is a CDISC standard, based on XML, allowing sponsors to provide metadata for SDTM, ADaM and SEND electronic submissions to the FDA and the Japanese Ministry of Health.
But define.xml is more than that: it is also a very good way to exchange metadata for SDTM, ADaM and SEND datasets between partners in the clinical trial world.

As a CDISC define.xml instructor, I am often asked what the best practices are to generate define.xml files that not only are conform to the standard, but also correctly and concisely describe the metadata for SDTM, ADaM or SEND datasets.

In this blog, I will discuss a few of what are, in my opinion, the best practices for generating define.xml. Although never asked for, I will also list what i.m.o. are the worst practices for creating define.xml.

In the following, I will use SDTM a lot as an example, but usually the same applies to ADaM and SEND.

Best practices

If you are a sponsor outsourcing the generation of SDTM datasets (e.g. to a CRO), the best practice is to generate a define.xml that can be used by the company that will need to provide the SDTM datasets, as a specification of what will have to be delivered. This means that it might be an extremely good idea to generate a define.xml for a study, even before the study has started. If you had similar studies before for which you do already have a define.xml, this is pretty straightforward. Your define.xml that you supply as a specification does not need to be complete, but if you designed your study well (taking into account that you want to submit it later), it should be nearly-complete.

Here is a slide set of my colleague Philippe Verplancke about this, even showing how a define.xml can be used as a direct source for CRF design for the next study.

Another best practice is to design your study using a study design tool, and generate an ODM metadata file from it, and do the mapping between study design and SDTM, even before the study starts. There are several good software tools on the market for doing so. I have a customer who is even doing the complete mapping between ODM and SDTM long before the study starts, as a quality control for the study design. The idea is very simple: "if we cannot map the study design to SDTM now (before the study starts), we will be in  big trouble later". The define.xml itself is used to store the mapping instructions.

Even if your study has already started, it is a very good practice to do this mapping long before the database closes. Partial data can be used to test the mappings, and if you do it well, you can just run the mappings only once (maybe taking half an hour or so) after data base closure. And you do already have your define.xml - you only need to remove the mapping instructions from it.

The reason for this is that the best tools on the market all use the same method: during generating the mappings (usually using drag-and-drop for generating mapping scripts that can then be refined), a define.xml is kept in the background (also keeping the mappings) and automatically synschronized. So if you change something in the mapping, this is automatically reflected in the underlying define.xml.

Whether you use a special tool or a statistical package, essential is that you have a process in place where your define.xml is fully synchronized with your generated or to-be generated SDTM datasets.

Other good practices

Learn XML or have someone in your team that has XML knowledge. My undergraduate students learn XML in just two lectures (each of 90 minutes), and one or two exercise afternoons. Learning XML is easy.
With a litte XML knowledge, you can generate a define.xml file for your SDTM datasets starting from an existing sample file (like the one published by the CDISC define.xml team). Estimated effort: about 1-2 days. Use an XML editor (and not NotePad or WordPad or so - disaster is preprogrammed) - there are even some very good XML editors that are free.
When you edit a define.xml file, be sure that you regularly validate it against the define.xml XML-schema (most XML editors have this functionality) which ensures you that your basic structure is correct. Read the "define.xml validation whitepaper" for more information. There are some good tools on the market for validating define.xml files, however, do avoid those that do not allow you to validate your define.xml against the define.xml XML-Schema (such tools have implemented non-official "rule interpretations" which are the vendors own interpretations of the define.xml specification).

There are also a few tools on the market that allow you to generate / edit define.xml in a user-friendly way without needing to see the XML itself (several new ones have been announced). As long as such a tool allows you to validate your define.xml against the XML-Schema and against other rules from the define.xml specification, and allow you to inspect the source XML and allow you inspect the result using the stylesheet, such a tool may be a good choice.

Bad practices

A bad practice that is often followed is to try to generate a define.xml file from a set of SDTM files (XPT files) using a "black box" tool (post-generation). You can use such a tool to generate a "first guess" of a define.xml file for your study, but you will still need either to edit it using an XML editor, or to use a tool mentioned before, to complete and fine-tune the define.xml.
Many users however expect such tools to generate a correct and ready-to-submit define.xml file. They inspect the result using the stylesheet (without inspecting the source XML nor validating it against the XML-Schema), and are surprised that it doesn't work or doesn't provide what they expect.
One such tool allows you to write the "instructions" in an Excel worksheet and then use that to "automatically" create a define.xml file. There is however no manual how to write these instructions, you are told you should take an existing define.xml file (for example published by CDISC), and convert that to an Excel worksheet with such "instructions" and learn from that how to write or complete the worksheet for your own submission. In the time needed to find out how that works, you could already have become an XML expert!

Many people think that define.xml is what they see in the browser (using the stylesheet), not realizing (or not wanting to) that there is an XML file behind that, a file that is machine-readable and that can be used for much more than just displaying the study metadata. Unfortunately, most reviewers at the FDA do not realize this either.

Worst practices

Some companies have similar processes in place where the define.xml is generated post-SDTM, usually using Excel worksheets, or even using Word documents. Especially the latter is extremely dangerous, as XML usually uses UTF-8 encoding (the international standard), and your Word document might be using an encoding that is incompatible with UTF-8. So if your copy-and-paste from a Word document into an XML document, and you have characters that go beyond ASCII-7, be not surprised when you see the strangest things in the result.

  • create your define.xml before your study starts
  • use a tool or process that keeps your define.xml in sync with your mappings between study design and SDTM
  • post-SDTM generation of define.xml is a "last resource method" - try to avoid it
  • do not expect "black-box" tools to generate a "submission-ready" define.xml file for you
  • avoid the use of Excel worksheets and Word documents to generate a define.xml
  • learn XML - it is not difficult! Even if you use good tools, it will help you to understand what you have created