Informs Data Mining Contest

Tuesday, August 5, 2008

Part 1 submission instructions

After some false starts, I believe we can specify the exact form of Part 1 submissions. Our universe consists of patient visits - that is, one answer for each time a patient is admitted to the hospital (so, some patients will have multiple entries, which as you've seen, may have different answers). Secondly, in order to judge by AUC, we do need a ranked list, rather than just a series of yes/no answers.

So, to enter Part 1, email me (nick-street@uiowa.edu) a file containing:

17,687 lines (one for each record in the test hospital file)
Each line contains one EVNTIDX (the unique identifier on a patient visit), and nothing else
The visits are ordered by probability of a pneumonia diagnosis (primary or secondary), in descending order

For example, if you decide that patient 58603044 is very likely to have pneumonia every time he/she is admitted, and patient 30015017 is very unlikely to have it, then your file might look like

586030440136
586030440129
586030440150
...<17,683>...
300150170132

One submission will be judged per team - if you send subsequent entries before the deadline, the new one will replace the previous ones.

Monday, August 4, 2008

Test files

The test files have been posted on the "Datasets and Documentation" page. As suggested earlier, the following cleaning has been done to (I hope) remove any leakers:

Hospital: All pneumonia ICD codes have been removed.
Medications: All records containing an ICD code for pneumonia have been removed. Also, the CCC codes (an alternative coding for diagnosis) have been removed from all records.
Conditions: All records containing an ICD code for pneumonia have been removed.
Demographics: No cleaning.

Monday, July 21, 2008

Follow Up Questions on Part1 and Part 2 Objectives

Thanks much Nick for your detailed responses. Here are a couple of follow up questions

1. Our objective in this contest is not to eliminate the process of Primary diagnosis through predictive modeling or with some protocol. This means, every individual will go through primary diagnosis during which Physician will determine the actual condition of the patient (whether an individual has Pneumonia or not). Also, the likelihood of an individual contracting Nosocomial Pneumonia is to be predicted as part of the solution to Part 2 of the contest. Given these, is Part 1 of the contest relevant?

2. My understanding of the objective of Part 2 is – The protocols developed for Part 2 will only be used to rank order the individuals based on their risk of contracting nosocomial Pneumonia and further a subset of high risk population will be targeted for preventive measure based on the cost model. In other words, these protocols will not be used to predict the likelihood of contracting the infection for individuals when they are admitted to the hospital as the information from hospital and medication files like NUMNIGHT, ANYOPER, and other medication related variables (which are important for determining the risk) will not be available at the time of admission to the hospital. Is my understanding of the purpose of Part 2 correct?

3. Other than the DUPERSID, the year and the RX1CD1X, is there a common ID between the hospital file and the medication file by which the can be merged? I tried merging by LINKIDX from the medication file and EVNTIDX from the hospital file. But, no match was found. For eg, for the DUPERSID 20072019, there are three records in the hospital file diagnosed for ICD1X code of 577 and 530 (repeated twice). In the medications file, for the same DUPERSID, there are 29 records (2 records for 577 and 27 records for 530). How do we relate various prescriptions for 530 in the medication file to the corresponding two records in the hospital file? Merging by DUPERSID, Year and RX1CD1X will not help here as it will be a many to many merge.

4. The variable rxtotal in the medication file is missing for more than 95% of the records. Isn’t it the total medication charges charged by the hospital which is important for the cost model?

5. There are a few variables in the data dictionary which says the variable has been edited or imputed. For example, NUMNIGHX from the hospital file and RXFORM, RXFRMUNT, etc. from the medication file are imputed variables. Is it safe not to worry about the imputation procedure?

6. In the medication file, there are a series of MULTUM THERAPEUTIC CLASS variable. I am referring to TC1 for example here. The data dictionary has the following categories – 1, 20, 28, 40, and 57 for this variable. However, in the data, 50% of records take values like 19, 97, 122, 133, 242, etc. Should we proceed as is irrespective of what the category means for this variable?

Medication dataset questions

1. We observed a variable name ‘LINKIDX’ in the medication dataset. It is defined as ‘ID FOR LINKAGE TO COND/OTH EVENT FILES.’ We are not sure how it links the files. It appears to be made by DUID + PID + X. What is ‘X?’

2. It has variables ‘RXCCC1X’, ‘RXCCC2X’, and ‘RXCCC2X’. They are defined as ‘MODIFIED CLINICAL CLASS CODE.’ Is it a modification of ICD9 codes in Hospital/Conditions dataset? If so, how are they modified? How are they related to each other as well as with ICD9 codes in Conditions and Hospital datasets?

3. Similarly, is there a relation between RXICD1X, RXICD2X, and RXICD3X, and ICD9 codes in other files? If they are related, what is the relationship? Are they related to RXCCC1X, RXCCC2X, and RXCCC2X too?

4. Which variable(s) will be removed from the testing dataset?

Friday, July 18, 2008

Commentary

Before diving into the latest run of questions (and corrections), I want to share a few high-level thoughts about how this contest is set up.

The main goal that Patricia and I had in mind when defining this problem was to have a significant decision or policy problem that could be at least partially solved using (in part) data mining techniques. After all, INFORMS is all about making better decisions, right? Unfortunately, this goal is in many ways inconsistent with the usual construction of data mining contests, which usually have well-defined targets (to make scoring easier, among other reasons). Although our clinical goal is clear (design an effective plan for prophylactic treatment of infections), the data recorded by the hospital is not complete enough to perfectly 'answer' the underlying classification problem (who is likely to get the infection?).

This was the motivation for dividing the contest into two parts. As mentioned in previous answers, we have simplified Part 1 down to: Find the patients who have been (or will be) diagnosed with pneumonia. That way, we will have a clear, unambiguous set of labels with which to judge the results, even though the clinical relevance of the problem is substantially reduced. So, this part should be accessible for anybody willing to go through the process of cleaning, transforming, etc. on this (fairly noisy) data set.

Part 2 gets to the heart of the real problem, and has generated most of the questions. The bottom line is, there is no perfect way to tell which pneumonia cases are nosocomial and which ones are not, because this information does not get directly recorded. There are some cases that should be obvious, based on the coding of primary vs. secondary diagnoses, patient records vs. hospital records, etc., but that leaves quite a number that are still ambiguous. Resolving these cases - that is, finding the people who are likely to contract nosocomial pneumonia, and figuring out how much it costs to treat them for this infection - is the core of Part 2. So, while we'll try to pass along some pointers, we can't give you the rule to create target lables in Part 2, since a) there isn't a perfect one, and b) creating a good one is part of the contest.

Unfortunately, none of the organizers are domain experts; although we have some significant experience mining medical data, none of us are clinicians. One good piece of meta-advice (on any real-world problem) is to find such an expert.

----

Last thing - sorry about the slowness of the answers - we'll do better down the home stretch.

Thursday, July 17, 2008

Questions on Defining Nosocomial Pneumonia

In response to the question on how to identify Nosocomial Pneumonia patients, there were different ways suggested to determine or validate on which some questions below

1. Patricia suggested “ Examine the DRG code for the patient-is it pneumonia or something else? Is pneumonia primary or secondary? If the DRG is for something else, the pneumonia is nosocomial.”

a. Is DRG code identified by the variable “RXNAME” (If am right, “RXNDC” is the
corresponding code for each RXNAME) in the medication file? If so, how do we
identify from the name which is for Pneumonia versus not?

2. There are 384 patients with one or more of the diagnosis suggesting Pneumonia of which 257 patients have SPECCOND = 1 (meaning Pneumonia is due to the hospital stay) and remaining 124 patients have missing SPECCOND. Does this mean these 257 patients are nosocomial Pneumonia patients and the rest are not?

Data Questions on Part 1 of the Contest

I understand the patient data spans across two years (2003 and 2004). Can you throw more light on data that would answer the following questions?

This is a repetitive question on what exactly the variable ICD9CODX in the conditions data mean. The data dictionary defines this as “Patient Condition”. Does this mean “Patient Condition for which the individual visited the hospital?” If so, doesn’t this confirm that the Pneumonia patients in this file are not nosocomial?
While I understand the variables ICD9CODX in the conditions file and the diagnosis variables ICD1X to ICD4X in the hospital file gives the condition of the patient in the course of the year, what is the relationship between the values of these variables in the two files? For example, there are cases where the conditions in the Condition file do not exactly match with the conditions in the hospital file (DUPERSID = “20048015”)
I was of the understanding that conditions file will be used for identifying the Pneumonia patients which then can be used to classify if it was nosocomial or not. But, out of 971 unique patients identified with Pneumonia (ICD9CODX between 479 and 487), there are 693 patients who are not identified with Pneumonia using the primary or secondary diagnosis code in the hospital file. Is the hospital file complete?
For conditions and demographic files, it is clear on what the duplicate records by the unique identifier (DUPERSID) mean. However, the same is not quite clear for the Hospital and Medication files. For example, The Dupersid “20048015” has four records in the conditions file one record each for different values of ICD9CODX. The values of ICD9CODX in the conditions file are 401, 195, 787 and 300. However, for the same DUPERSID, four records are found in the Hospital file. Of these, two records have ICD1X takes the value 195 and two records have 560 for ICD1X.

(i) Why are there duplicates for ICD1X when there is only one record for a condition in the conditions file? What does
the duplicate record for the same condition mean in the hospital file?

(ii) The condition 560 found in the hospital file is not found in the conditions file. Why could this happen?

(iii) The conditions 401, 787 and 300 present in the conditions file are missing in the hospital file. How do we explain
this?

(iv) There are 41 records for DUPERSID (20048015) in medication file. How do we interpret the duplicate records in
the file?

(v) In general, how do we interpret the duplicates for an individual in the hospital and the medications files?

There are about 83% of records for which many key variables like SPECCOND, RSNINHOS and most of the variables in hospital file are missing. How do we interpret the missing values?
There are 10% of records in the Demographic file for which EDUCYEAR takes the value -1. -1 is not specified in the data dictionary. Does -1 indicate “Inapplicable”?
What is the relation between Poverty and Income variable? 35% of Individual with 0 income are classified with Poverty = 4 and 5. So, I am just curious to understand if poverty is based on a broader classification.