Thursday, July 17, 2008

Data Questions on Part 1 of the Contest

I understand the patient data spans across two years (2003 and 2004). Can you throw more light on data that would answer the following questions?

  • This is a repetitive question on what exactly the variable ICD9CODX in the conditions data mean. The data dictionary defines this as “Patient Condition”. Does this mean “Patient Condition for which the individual visited the hospital?” If so, doesn’t this confirm that the Pneumonia patients in this file are not nosocomial?

  • While I understand the variables ICD9CODX in the conditions file and the diagnosis variables ICD1X to ICD4X in the hospital file gives the condition of the patient in the course of the year, what is the relationship between the values of these variables in the two files? For example, there are cases where the conditions in the Condition file do not exactly match with the conditions in the hospital file (DUPERSID = “20048015”)

  • I was of the understanding that conditions file will be used for identifying the Pneumonia patients which then can be used to classify if it was nosocomial or not. But, out of 971 unique patients identified with Pneumonia (ICD9CODX between 479 and 487), there are 693 patients who are not identified with Pneumonia using the primary or secondary diagnosis code in the hospital file. Is the hospital file complete?

  • For conditions and demographic files, it is clear on what the duplicate records by the unique identifier (DUPERSID) mean. However, the same is not quite clear for the Hospital and Medication files. For example, The Dupersid “20048015” has four records in the conditions file one record each for different values of ICD9CODX. The values of ICD9CODX in the conditions file are 401, 195, 787 and 300. However, for the same DUPERSID, four records are found in the Hospital file. Of these, two records have ICD1X takes the value 195 and two records have 560 for ICD1X.

(i) Why are there duplicates for ICD1X when there is only one record for a condition in the conditions file? What does
duplicate record for the same condition mean in the hospital file?

(ii) The condition 560 found in the hospital file is not found in the conditions file. Why could this happen?

(iii) The conditions 401, 787 and 300 present in the conditions file are missing in the hospital file. How do we explain

(iv) There are 41 records for DUPERSID (20048015) in medication file. How do we interpret the duplicate records in
the file?

(v) In general, how do we interpret the duplicates for an individual in the hospital and the medications files?

  • There are about 83% of records for which many key variables like SPECCOND, RSNINHOS and most of the variables in hospital file are missing. How do we interpret the missing values?

  • There are 10% of records in the Demographic file for which EDUCYEAR takes the value -1. -1 is not specified in the data dictionary. Does -1 indicate “Inapplicable”?

  • What is the relation between Poverty and Income variable? 35% of Individual with 0 income are classified with Poverty = 4 and 5. So, I am just curious to understand if poverty is based on a broader classification.


Nick said...

Rather than going point by point here, I will try to summarize the role of the ICD9 codes in the different files and hope that answers most of your questions - please follow up if I miss something.

The records in the conditions file record all of the conditions that the patient was diagnosed with at any point in the year. There is no preset limit on the number of these.

Records in the hospital file record no more that four diagnoses for each patient - the ones the physician thought were most relevant, presumably. In most cases, the primary diagnosis is the reason for the hospitalization (see SPECCOND). Logically, you would expect this list to be a subset of the list in the conditions file, but as you point out, this is not always the case - most likely either a coding error or an omission in the conditions file.

There are fewer patients in the hospital file than the conditions file because not everybody gets admitted.

Nick said...

In response to other questions from Sri's post:

- There are often many records for a particular ID in the medications file simply because some people have a lot of prescriptions (refills, etc. get recorded separately).

- As far as we know, missing values have no specific semantics beyond "not entered."

- Any negative value in a field like education should be considered missing.

- Ideally, "income" is in fact separate from "class," although you would expect them to be strongly related... think of people living on pensions, etc. Roughly speaking, income is based on recent wages, poverty is a level of (lack of) wealth.

Sri said...

Thanks Nick for answering the questions from my team