Friday, July 18, 2008


Before diving into the latest run of questions (and corrections), I want to share a few high-level thoughts about how this contest is set up.

The main goal that Patricia and I had in mind when defining this problem was to have a significant decision or policy problem that could be at least partially solved using (in part) data mining techniques. After all, INFORMS is all about making better decisions, right? Unfortunately, this goal is in many ways inconsistent with the usual construction of data mining contests, which usually have well-defined targets (to make scoring easier, among other reasons). Although our clinical goal is clear (design an effective plan for prophylactic treatment of infections), the data recorded by the hospital is not complete enough to perfectly 'answer' the underlying classification problem (who is likely to get the infection?).

This was the motivation for dividing the contest into two parts. As mentioned in previous answers, we have simplified Part 1 down to: Find the patients who have been (or will be) diagnosed with pneumonia. That way, we will have a clear, unambiguous set of labels with which to judge the results, even though the clinical relevance of the problem is substantially reduced. So, this part should be accessible for anybody willing to go through the process of cleaning, transforming, etc. on this (fairly noisy) data set.

Part 2 gets to the heart of the real problem, and has generated most of the questions. The bottom line is, there is no perfect way to tell which pneumonia cases are nosocomial and which ones are not, because this information does not get directly recorded. There are some cases that should be obvious, based on the coding of primary vs. secondary diagnoses, patient records vs. hospital records, etc., but that leaves quite a number that are still ambiguous. Resolving these cases - that is, finding the people who are likely to contract nosocomial pneumonia, and figuring out how much it costs to treat them for this infection - is the core of Part 2. So, while we'll try to pass along some pointers, we can't give you the rule to create target lables in Part 2, since a) there isn't a perfect one, and b) creating a good one is part of the contest.

Unfortunately, none of the organizers are domain experts; although we have some significant experience mining medical data, none of us are clinicians. One good piece of meta-advice (on any real-world problem) is to find such an expert.


Last thing - sorry about the slowness of the answers - we'll do better down the home stretch.

No comments: