Tuesday, August 5, 2008

Part 1 submission instructions

After some false starts, I believe we can specify the exact form of Part 1 submissions. Our universe consists of patient visits - that is, one answer for each time a patient is admitted to the hospital (so, some patients will have multiple entries, which as you've seen, may have different answers). Secondly, in order to judge by AUC, we do need a ranked list, rather than just a series of yes/no answers.

So, to enter Part 1, email me (nick-street@uiowa.edu) a file containing:
  • 17,687 lines (one for each record in the test hospital file)
  • Each line contains one EVNTIDX (the unique identifier on a patient visit), and nothing else
  • The visits are ordered by probability of a pneumonia diagnosis (primary or secondary), in descending order
For example, if you decide that patient 58603044 is very likely to have pneumonia every time he/she is admitted, and patient 30015017 is very unlikely to have it, then your file might look like


One submission will be judged per team - if you send subsequent entries before the deadline, the new one will replace the previous ones.

Monday, August 4, 2008

Test files

The test files have been posted on the "Datasets and Documentation" page. As suggested earlier, the following cleaning has been done to (I hope) remove any leakers:

Hospital: All pneumonia ICD codes have been removed.
Medications: All records containing an ICD code for pneumonia have been removed. Also, the CCC codes (an alternative coding for diagnosis) have been removed from all records.
Conditions: All records containing an ICD code for pneumonia have been removed.
Demographics: No cleaning.

Monday, July 21, 2008

Follow Up Questions on Part1 and Part 2 Objectives

Thanks much Nick for your detailed responses. Here are a couple of follow up questions

1. Our objective in this contest is not to eliminate the process of Primary diagnosis through predictive modeling or with some protocol. This means, every individual will go through primary diagnosis during which Physician will determine the actual condition of the patient (whether an individual has Pneumonia or not). Also, the likelihood of an individual contracting Nosocomial Pneumonia is to be predicted as part of the solution to Part 2 of the contest. Given these, is Part 1 of the contest relevant?

2. My understanding of the objective of Part 2 is – The protocols developed for Part 2 will only be used to rank order the individuals based on their risk of contracting nosocomial Pneumonia and further a subset of high risk population will be targeted for preventive measure based on the cost model. In other words, these protocols will not be used to predict the likelihood of contracting the infection for individuals when they are admitted to the hospital as the information from hospital and medication files like NUMNIGHT, ANYOPER, and other medication related variables (which are important for determining the risk) will not be available at the time of admission to the hospital. Is my understanding of the purpose of Part 2 correct?

3. Other than the DUPERSID, the year and the RX1CD1X, is there a common ID between the hospital file and the medication file by which the can be merged? I tried merging by LINKIDX from the medication file and EVNTIDX from the hospital file. But, no match was found. For eg, for the DUPERSID 20072019, there are three records in the hospital file diagnosed for ICD1X code of 577 and 530 (repeated twice). In the medications file, for the same DUPERSID, there are 29 records (2 records for 577 and 27 records for 530). How do we relate various prescriptions for 530 in the medication file to the corresponding two records in the hospital file? Merging by DUPERSID, Year and RX1CD1X will not help here as it will be a many to many merge.

4. The variable rxtotal in the medication file is missing for more than 95% of the records. Isn’t it the total medication charges charged by the hospital which is important for the cost model?

5. There are a few variables in the data dictionary which says the variable has been edited or imputed. For example, NUMNIGHX from the hospital file and RXFORM, RXFRMUNT, etc. from the medication file are imputed variables. Is it safe not to worry about the imputation procedure?

6. In the medication file, there are a series of MULTUM THERAPEUTIC CLASS variable. I am referring to TC1 for example here. The data dictionary has the following categories – 1, 20, 28, 40, and 57 for this variable. However, in the data, 50% of records take values like 19, 97, 122, 133, 242, etc. Should we proceed as is irrespective of what the category means for this variable?

Medication dataset questions

1. We observed a variable name ‘LINKIDX’ in the medication dataset. It is defined as ‘ID FOR LINKAGE TO COND/OTH EVENT FILES.’ We are not sure how it links the files. It appears to be made by DUID + PID + X. What is ‘X?’

2. It has variables ‘RXCCC1X’, ‘RXCCC2X’, and ‘RXCCC2X’. They are defined as ‘MODIFIED CLINICAL CLASS CODE.’ Is it a modification of ICD9 codes in Hospital/Conditions dataset? If so, how are they modified? How are they related to each other as well as with ICD9 codes in Conditions and Hospital datasets?

3. Similarly, is there a relation between RXICD1X, RXICD2X, and RXICD3X, and ICD9 codes in other files? If they are related, what is the relationship? Are they related to RXCCC1X, RXCCC2X, and RXCCC2X too?

4. Which variable(s) will be removed from the testing dataset?

Friday, July 18, 2008


Before diving into the latest run of questions (and corrections), I want to share a few high-level thoughts about how this contest is set up.

The main goal that Patricia and I had in mind when defining this problem was to have a significant decision or policy problem that could be at least partially solved using (in part) data mining techniques. After all, INFORMS is all about making better decisions, right? Unfortunately, this goal is in many ways inconsistent with the usual construction of data mining contests, which usually have well-defined targets (to make scoring easier, among other reasons). Although our clinical goal is clear (design an effective plan for prophylactic treatment of infections), the data recorded by the hospital is not complete enough to perfectly 'answer' the underlying classification problem (who is likely to get the infection?).

This was the motivation for dividing the contest into two parts. As mentioned in previous answers, we have simplified Part 1 down to: Find the patients who have been (or will be) diagnosed with pneumonia. That way, we will have a clear, unambiguous set of labels with which to judge the results, even though the clinical relevance of the problem is substantially reduced. So, this part should be accessible for anybody willing to go through the process of cleaning, transforming, etc. on this (fairly noisy) data set.

Part 2 gets to the heart of the real problem, and has generated most of the questions. The bottom line is, there is no perfect way to tell which pneumonia cases are nosocomial and which ones are not, because this information does not get directly recorded. There are some cases that should be obvious, based on the coding of primary vs. secondary diagnoses, patient records vs. hospital records, etc., but that leaves quite a number that are still ambiguous. Resolving these cases - that is, finding the people who are likely to contract nosocomial pneumonia, and figuring out how much it costs to treat them for this infection - is the core of Part 2. So, while we'll try to pass along some pointers, we can't give you the rule to create target lables in Part 2, since a) there isn't a perfect one, and b) creating a good one is part of the contest.

Unfortunately, none of the organizers are domain experts; although we have some significant experience mining medical data, none of us are clinicians. One good piece of meta-advice (on any real-world problem) is to find such an expert.


Last thing - sorry about the slowness of the answers - we'll do better down the home stretch.

Thursday, July 17, 2008

Questions on Defining Nosocomial Pneumonia

In response to the question on how to identify Nosocomial Pneumonia patients, there were different ways suggested to determine or validate on which some questions below

1. Patricia suggested “ Examine the DRG code for the patient-is it pneumonia or something else? Is pneumonia primary or secondary? If the DRG is for something else, the pneumonia is nosocomial.”

a. Is DRG code identified by the variable “RXNAME” (If am right, “RXNDC” is the
corresponding code for each RXNAME) in the medication file? If so, how do we
identify from the name which is for Pneumonia versus not?

2. There are 384 patients with one or more of the diagnosis suggesting Pneumonia of which 257 patients have SPECCOND = 1 (meaning Pneumonia is due to the hospital stay) and remaining 124 patients have missing SPECCOND. Does this mean these 257 patients are nosocomial Pneumonia patients and the rest are not?

Data Questions on Part 1 of the Contest

I understand the patient data spans across two years (2003 and 2004). Can you throw more light on data that would answer the following questions?

  • This is a repetitive question on what exactly the variable ICD9CODX in the conditions data mean. The data dictionary defines this as “Patient Condition”. Does this mean “Patient Condition for which the individual visited the hospital?” If so, doesn’t this confirm that the Pneumonia patients in this file are not nosocomial?

  • While I understand the variables ICD9CODX in the conditions file and the diagnosis variables ICD1X to ICD4X in the hospital file gives the condition of the patient in the course of the year, what is the relationship between the values of these variables in the two files? For example, there are cases where the conditions in the Condition file do not exactly match with the conditions in the hospital file (DUPERSID = “20048015”)

  • I was of the understanding that conditions file will be used for identifying the Pneumonia patients which then can be used to classify if it was nosocomial or not. But, out of 971 unique patients identified with Pneumonia (ICD9CODX between 479 and 487), there are 693 patients who are not identified with Pneumonia using the primary or secondary diagnosis code in the hospital file. Is the hospital file complete?

  • For conditions and demographic files, it is clear on what the duplicate records by the unique identifier (DUPERSID) mean. However, the same is not quite clear for the Hospital and Medication files. For example, The Dupersid “20048015” has four records in the conditions file one record each for different values of ICD9CODX. The values of ICD9CODX in the conditions file are 401, 195, 787 and 300. However, for the same DUPERSID, four records are found in the Hospital file. Of these, two records have ICD1X takes the value 195 and two records have 560 for ICD1X.

(i) Why are there duplicates for ICD1X when there is only one record for a condition in the conditions file? What does
duplicate record for the same condition mean in the hospital file?

(ii) The condition 560 found in the hospital file is not found in the conditions file. Why could this happen?

(iii) The conditions 401, 787 and 300 present in the conditions file are missing in the hospital file. How do we explain

(iv) There are 41 records for DUPERSID (20048015) in medication file. How do we interpret the duplicate records in
the file?

(v) In general, how do we interpret the duplicates for an individual in the hospital and the medications files?

  • There are about 83% of records for which many key variables like SPECCOND, RSNINHOS and most of the variables in hospital file are missing. How do we interpret the missing values?

  • There are 10% of records in the Demographic file for which EDUCYEAR takes the value -1. -1 is not specified in the data dictionary. Does -1 indicate “Inapplicable”?

  • What is the relation between Poverty and Income variable? 35% of Individual with 0 income are classified with Poverty = 4 and 5. So, I am just curious to understand if poverty is based on a broader classification.

Thursday, June 19, 2008

Questions on data set and nosocomial pneumonia as a response

We published this post previously, on June 11, but it was placed as a comment to another posting. We hope it will appear as a separate thread this time. We have a few questions, listed below:

1. Most of the patients in the hospital data set have records in the conditions data set. However, the conditions data set covers many more patients. Assuming that the DUPERSID is the unique identifier for an individual patient, and noting that there are multiple records per DUPERSID, the conditions data set covers 43,151 patients, while the hospital data set covers 11,846 patients. Of these 11,846 patients, 11,724 are represented in the conditions data set.

Both the conditions and the hospital data sets contain information on patients with pneumonia. There are 971 patients in the conditions data set with at least one ICD9CODX value between 480 and 486. In the hospital data set, there are 278 unique DUPERSIDs with at least one of Icd1x, Icd2x, Icd3x, and Icd4x between 480 and 486. (These 278 patients all appear to be in the conditions data base as well, with a code of pneumonia for at least one record.)

Our first question relates to the relevance of the conditions data base. Since there does not appear to be any way to determine if pneumonia for these patients was nosocomial, the conditions data set seems not to be relevant to classification into a "likely to contract nosocomial pneumonia" group. Are we correct in this conclusion?

2. Related to the first question, what exactly does the variable ICD9CODX in the conditions data set represent? How does it relate to the four diagnosis codes in the hospital data?

3. Back to the hospital data set. What exactly does SPECCOND represent? How does it help identify the nosocomial pneumonia cases? The definition is vague, and does not correlate directly with RSINNHOS. For example, there are 238 records where SPECCOND is 2, so the hospital stay is not related to the condition. One would suspect that nosocomial pneumonia cases would be included in this number. However, of the 257 records that show pneumonia codes in one of the four ICDx variables and where SPECCOND is not missing, all have SPECCOND = 1.

So, SPECCOND does not help in determining nosocomial. Now RSINHOS may help, as indicated in the blog. However, if one looks at those patients with a pneumonia code and with a 1, 3, 4, or 5 RSINHOS code, one finds exactly 19 such patients. As for ANYOPER, there are 15 patients with operations who had pneumonia as well. These 15 patients have an overlap with the 19 patients, giving us a total of 24 patients whom we can be reasonably sure contracted nosocomial pneumonia. (This does not seem like a big improvement over the situation with 20 MRSA patients!) Is our logic in reaching this conclusion flawed?

4. In the medication data set Rx0304, it is not clear what the nine variables on the Type of Pharmacy Provider (PHARTP1 through PHARTP9) represent. How do these nine variables relate to a single prescription?

Friday, April 18, 2008

Two more questions

1. We only found 20 MRSA cases(by searching V09) in the hospital file, and only 9 patients in the patient conditions file,are these whole MRSA cases or are there more cases in the data set?
2. For the part 2, whether the ultimate objective is to reduce the total cost for MRSA (prophylactic + treatment), or the costs for all the other infections.

Tuesday, April 15, 2008


1. MRSA is represented by ICD9 code V09.0. It can be identified either in the hospital file using one of the ICD9 fields, or in the patient conditions file, which lists all conditions for which a patient was diagnosed for that year.

2. There is a cost of preventive care that you can determine from the prescription database. It generally requires the antibiotic, Vancomycin or the antibiotic, Zyvox (linezolid). Then there is the added cost of MRSA, which is not uniform; rather, it depends upon the patient condition and the patient procedures performed. You will need to estimate this added cost from the data. Both emergency and elective patients have some risk of MRSA. The hospital will want to use the preventive treatment if it costs less than the added cost of MRSA.

3. The fields that were not specifically identified in the data dictionary are not needed for the analysis.

Thursday, April 10, 2008

Misc. Questions

1. Part 1. We'd like to make sure which variable in the data set can identify the patient diagnosed with MRSA.

2. Part 2 Shall we minimize a total cost of medication for all patients or only for those who were admitted for an elective surgery?

3 some unclear variables:
Hospital0304: VARPSU, VARSTR(for which PSU)

Monday, April 7, 2008

Data Issues In RX0304

This file is CSV with quoted fields if a comma is part of the field.

1. There are a few instances where a quote is not part of a matching pair which throws the values off by one column. Typically the pattern is a GX qualifier followed by a dimension that has a quote in it, but it is the same quote that is used to delimit the field.

Here are the line numbers that I have checked that have this issue.

40490, 347931, 347932, 347933, 347934, 347935, 371124, 371125, 371126, 371144, 371152, 417874, 417875, 417876, 469465, 469466, 566255, 621479...there are more later.

2. There are two instances where the PHARTP1 has invaild value of "%"

139722, 139723

Be aware since this can cause problems with import routines.


Tuesday, April 1, 2008


Thanks for checking out the blog for the 2008 INFORMS Data Mining Contest. Please post questions here so they can be answered publicly, and check back for updates. Good luck!