Friday, April 18, 2008

Two more questions

1. We only found 20 MRSA cases(by searching V09) in the hospital file, and only 9 patients in the patient conditions file,are these whole MRSA cases or are there more cases in the data set?
2. For the part 2, whether the ultimate objective is to reduce the total cost for MRSA (prophylactic + treatment), or the costs for all the other infections.


Dmitriy said...

That's an excellent question.

Nick said...

Apologies for the delay - we're looking into this one.

Nick said...

OK, we're back.

It was an excellent question. Skewed distributions make for interesting problems but this seems a little extreme. As a result, we have decided to change to target variable to a different, more common infection - pneumonia.

Patients with pneumonia can be identified by the ICD-9 codes in the range 480-486.

Unless I'm missing something (else), the challenge problems remain otherwise unchanged, and the clinical relevance is approximately the same.

Sorry for the confusion.

Igor said...


I just wanted to clarify the task. If I understand modified #1 correctly, the task is to predict which patients will acquire pneumonia during hospital stay. If so, which variable in the data file indicates whether patient acquired pneumonia in the hospital or entered the hospital with pneumonia? The closest I can guess is SPECCOND in hospital file, however it's unclear. Please advise...


Nick said...

Returning to Caren's second question: What we're asking you to do in part 2 is to estimate the cost specifically related to nosocomial pneumonia, and trade that off against the cost of the proposed prophylactic regimen.

As for Igor's question - how do you know if the patient picked up the infection in the hospital, or came in with it - there is no perfect answer to this, but we will follow up with some guidance soon.

Nick said...


For part 1, we're going to stick with any diagnosis of pneumonia as the target, just so we have a definite target variable and can evaluate the results unambiguously.

For part 2, we're looking for something more realistic, which requires estimating the costs for nosocomial pneumonia, which is not clearly identified in the dataset. See next post, passed along from Pat Cerrito.

Nick said...

I think that there are three variables that need to be considered here instead of just one:

Also provided are the following unedited variables: hospital inpatient stays related to a medical condition (SPECCOND); the reason the person entered the hospital (RSNINHOS); any operation or surgery performed while the respondent was in the hospital (ANYOPER)

SPECCOND is a yes or no and relates to the value of RSNINHOS:


Patients entering for 1, 3, 4, 5 with a diagnosis of pneumonia have a nosocomial infection. Patients entering for 2 may or may not have a nosocomial infection. Pneumonia is not treated via surgery.

Marie and Michele said...

1. Most of the patients in the hospital data set have records in the conditions data set. However, the conditions data set covers many more patients. Assuming that the DUPERSID is the unique identifier for an individual patient, and noting that there are multiple records per DUPERSID, the conditions data set covers 43,151 patients, while the hospital data set covers 11,846 patients. Of these 11,846 patients, 11,724 are represented in the conditions data set.

Both the conditions and the hospital data sets contain information on patients with pneumonia. There are 971 patients in the conditions data set with at least one ICD9CODX value between 480 and 486. In the hospital data set, there are 278 unique DUPERSIDs with at least one of Icd1x, Icd2x, Icd3x, and Icd4x between 480 and 486. (These 278 patients all appear to be in the conditions data base as well, with a code of pneumonia for at least one record.)

Our first question relates to the relevance of the conditions data base. Since there does not appear to be any way to determine if pneumonia for these patients was nosocomial, the conditions data set seems not to be relevant to classification into a "likely to contract nosocomial pneumonia" group. Are we correct in this conclusion?

2. Related to the first quesion, what exactly does the variable ICD9CODX in the conditions data set represent? How does it relate to the four diagnosis codes in the hospital data?

3. Back to the hospital data set. What exactly does SPECCOND represent? How does it help identify the nosocomial pneumonia cases? The definition is vague, and does not correlate directly with RSINNHOS. For example, there are 238 records where SPECCOND is 2, so the hospital stay is not related to the condition. One would suspect that nosocomial pneumonia cases would be included in this number. However, of the 257 records that show pneumonia codes in one of the four ICDx variables and where SPECCOND is not missing, all have SPECCOND = 1.

So, SPECCOND does not help in determining nosocomial. Now RSINHOS may help, as indicated in the blog. However, if one looks at those patients with a pneumonia code and with a 1, 3, 4, or 5 RSINHOS code, one finds exactly 19 such patients. As for ANYOPER, there are 15 patients with operations who had pneumonia as well. These 15 patients have an overlap with the 19 patients, giving us a total of 24 patients whom we can be reasonably sure contracted nosocomial pneumonia. (This does not seem like a big improvement over the situation with 20 MRSA patients!) Is our logic in reaching this conclusion flawed?

4. In the medication data set Rx0304, it is not clear what the nine variables on the Type of Pharmacy Provider (PHARTP1 through PHARTP9) represent. How do these nine variables relate to a single prescription?