Informs Data Mining Contest: Follow Up Questions on Part1 and Part 2 Objectives

Monday, July 21, 2008

Follow Up Questions on Part1 and Part 2 Objectives

Thanks much Nick for your detailed responses. Here are a couple of follow up questions

1. Our objective in this contest is not to eliminate the process of Primary diagnosis through predictive modeling or with some protocol. This means, every individual will go through primary diagnosis during which Physician will determine the actual condition of the patient (whether an individual has Pneumonia or not). Also, the likelihood of an individual contracting Nosocomial Pneumonia is to be predicted as part of the solution to Part 2 of the contest. Given these, is Part 1 of the contest relevant?

2. My understanding of the objective of Part 2 is – The protocols developed for Part 2 will only be used to rank order the individuals based on their risk of contracting nosocomial Pneumonia and further a subset of high risk population will be targeted for preventive measure based on the cost model. In other words, these protocols will not be used to predict the likelihood of contracting the infection for individuals when they are admitted to the hospital as the information from hospital and medication files like NUMNIGHT, ANYOPER, and other medication related variables (which are important for determining the risk) will not be available at the time of admission to the hospital. Is my understanding of the purpose of Part 2 correct?

3. Other than the DUPERSID, the year and the RX1CD1X, is there a common ID between the hospital file and the medication file by which the can be merged? I tried merging by LINKIDX from the medication file and EVNTIDX from the hospital file. But, no match was found. For eg, for the DUPERSID 20072019, there are three records in the hospital file diagnosed for ICD1X code of 577 and 530 (repeated twice). In the medications file, for the same DUPERSID, there are 29 records (2 records for 577 and 27 records for 530). How do we relate various prescriptions for 530 in the medication file to the corresponding two records in the hospital file? Merging by DUPERSID, Year and RX1CD1X will not help here as it will be a many to many merge.

4. The variable rxtotal in the medication file is missing for more than 95% of the records. Isn’t it the total medication charges charged by the hospital which is important for the cost model?

5. There are a few variables in the data dictionary which says the variable has been edited or imputed. For example, NUMNIGHX from the hospital file and RXFORM, RXFRMUNT, etc. from the medication file are imputed variables. Is it safe not to worry about the imputation procedure?

6. In the medication file, there are a series of MULTUM THERAPEUTIC CLASS variable. I am referring to TC1 for example here. The data dictionary has the following categories – 1, 20, 28, 40, and 57 for this variable. However, in the data, 50% of records take values like 19, 97, 122, 133, 242, etc. Should we proceed as is irrespective of what the category means for this variable?

6 comments:

Nick said...: Lots of excellent questions... here are some long-winded answers that I hope will help.

1. Well, relevance is a relative term. Part 1 is not strictly relevant to the broader task of predicting nosocomial pneumonia, no. However, being able to successfully predict a given diagnosis could have several uses, e.g.,

a) as a 2nd opinion (not replacing primary diagnosis, but helping it), or a training tool;
b) as a Medicare fraud detector (through examination of false negatives);
c) as a check on possible under-detection of a difficult-to-diagnose disease (through examination of false positives).

I'm probably missing others... in short, such a predictor could still be useful. More to the point, for purposes of this contest, as I mentioned before, we wanted some fairly easy "truth" to predict, so that half the contest would have an easily-quantifiable score.

2. Yes, you're right, there are some variables here that allow you to predict backwards in time. Strictly speaking, we should remove all variables that are not known at the time of admission. We're not going to do that, partly because it's too hard, but mainly because there are plenty of fields that would still be ambiguous, e.g., you might know at admission that you're going to have an operation, so should ANYOPER be removed or not? We're largely punting these questions, and I hope you'll find that the contest tasks are still sufficiently challenging. Yes, I believe your understanding of Part 2 is correct.

3. Apologies for the redundancy, but yeah, this is another excellent question, at least if I'm interpreting it correctly. If you want to match the prescribed medications with a particular hospital stay, my best suggestion is to line up the dates (along with the person IDs)... in general prescriptions given at discharge are probably filled a few days later (patient may not get to the pharmacy for a while, doctor may give some samples, etc.). Unfortunately there is no perfect way to know if a particular drug was prescribed for a particular hospital stay because, well, pharmacies don't record that sort of thing.

4. Not charged by the hospital. RXTOTAL is the total charged by the pharmacy. When it's missing, my best suggestion is to estimate it by summing the totals received from the various payers. As mentioned in another post, we don't have information on drug charges from the hospital - in my understanding, hospitals do not generally charge separately for medications. So, for the cost model, you have to figure out what meds were prescribed for pneumonia treatment, and how much they cost, then assume that in a case of nosocomial pneumonia, that cost would be charged by the hospital.

5. Yes it's safe. This is mostly done to standardize different reporting mechanisms in different states. Imputation was done by the data provider.

6. It's probably safe to ignore these. Our data coder (Patricia) decided not to include a complete set of definitions for the different values because she believes that these variables will be of little or no use to the task.; July 27, 2008 at 7:00 PM
Sri said...: Nick, excellent level of detail in your responses. Your answers gave clear directions on certain issues. Here are my thoughts based on your responses

1. I was worried as the objective of Part 1 is to predict the likelihood of a patient with Pneumonia at the time Patient visits the hospital (i.e before the primary diagnosis), we are constrained to use only the demographics file (as hospital and medication records are after the primary diagnosis). Since, the relevance to the contest is highly reduced, I wanted to double confirm with you if we still need to proceed. It is clear now and I agree with all the purpose you stated about this model.

2. Re-visiting my own question on Part 2, I guess the model serves a broader scope.
a. For the existing patients who did not contract Nosocomial Pneumonia, this model helps rank order and take preventive measure for high risky individuals
b. For the existing Nosocomial patients, the idea is not to use this as anyway, the hospital has to provide the treatment since the individuals acquired because of the hospital stay
c. For the “New Patients”, this model can be used to predict the risk earlier. For the hospital and medication variables which will not be available at the time of admission, we can always use “what if” criteria and evaluate the risk. For eg, after the primary diagnosis, we would approximately know how many nights the individual has to stay, if he has to be operated, what are the medicines prescribed for the diagnosis. So, we can assess the risk if the individual has to stay for 2 nights, he has to be operated and he is prescribed the medicine “Y”. Further more, if the hospital stay and medication variables doesn’t turn out to be predictive, then there is absolutely no issue.

My question in the blog covered only the scenarios (a) and (b).

3. You got my question right. My team did look for date variable in the files. However, the date variable is found only in the medication file in which approximately 28% of the prescriptions are between 1932 and 2002. Following are observed in the hospital data
a. There are multiple records for the same individual, Primary diagnosis and the year combination
b. The hospital file has only the year of data collected and no date variable
c. More often than not, the duplicate records for the same individual by the combination mentioned above is within the same year

It is not clear, how using the date variable will help identify the right medication for the corresponding hospital stay. Out of the 18400 unique combinations of Dupersid, Year, ICd1X, there are only 7837 combinations found in the medication file. Out of these 7837 combinations, 2046 combinations have multiple records in ‘both’ the files where the merging will be a problem. Sorting by the date in the medication will not help as 99% of the duplicates are within in the same year. Other than dropping these 2046 conditions for our analyses, I do not see a close approximation to identify what medicines correspond to ICD1X if we were to use the medication file.

No comments on (4) - (6)

1. In the context of duplicates for the same primary diagnosis in the hospital file, there are 5663 out of 33292 (17%) records which are duplicates by ALL the variables except the EVNTIDX. Are these valid records? Does EVNTIDX indicate two completely different events (for eg, multiple hospital stay by the same individual within the same year will have diferent EVNTIDX? There are no duplicates by EVNTIDX)

2. The data dictionary for Conditions file describes DUID as Household identifier while the hospital file defines it as Dwelling Unit Id. Is it something like a Ward (group of beds) (doesn’t seem like because there are approximately 5000 unique DUID in the hospital file which means there should be at least 5000 beds. However, the contest description says it is a 1000 bed hospital) where the patients are hospitalized? We are trying to see if we can get some predictive information if we understand what DUID means.

More later on the Cost model….; July 28, 2008 at 10:28 AM
Nick said...: forwarded responses:

1. A patient can be readmitted for the same problem. In this case, all the variables will be the same except for the event id.

2. DUID is household identifier, or the dwelling unit for a family. It has to do with the living arrangements and has nothing to do with the hospital.; August 5, 2008 at 8:30 AM
Anonymous said...: The medications file contains two useful pieces of information. First, if the patient was prescribed an antibiotic just prior to admission, that patient probably already had pneumonia. Second, the treatment for pneumonia costs money, and the medications file will give an indication of this cost. Since the hospital file does not specifically list antibiotic treatment costs, this figure must be derived.; August 12, 2008 at 8:43 AM
Anonymous said...: The medications file contains two useful pieces of information. First, if the patient was prescribed an antibiotic just prior to admission, that patient probably already had pneumonia. Second, the treatment for pneumonia costs money, and the medications file will give an indication of this cost. Since the hospital file does not specifically list antibiotic treatment costs, this figure must be derived.; August 12, 2008 at 8:54 AM
Anonymous said...: This comment has been removed by the author.; August 12, 2008 at 8:55 AM

Informs Data Mining Contest

Monday, July 21, 2008

Follow Up Questions on Part1 and Part 2 Objectives

6 comments:

Blog Archive

Contributors