Thanks much Nick for your detailed responses. Here are a couple of follow up questions1. Our objective in this contest is not to eliminate the process of Primary diagnosis through predictive modeling or with some protocol. This means, every individual will go through primary diagnosis during which Physician will determine the actual condition of the patient (whether an individual has Pneumonia or not). Also, the likelihood of an individual contracting Nosocomial Pneumonia is to be predicted as part of the solution to Part 2 of the contest. Given these, is Part 1 of the contest relevant?
2. My understanding of the objective of Part 2 is – The protocols developed for Part 2 will only be used to rank order the individuals based on their risk of contracting nosocomial Pneumonia and further a subset of high risk population will be targeted for preventive measure based on the cost model. In other words, these protocols will not be used to predict the likelihood of contracting the infection for individuals when they are admitted to the hospital as the information from hospital and medication files like NUMNIGHT, ANYOPER, and other medication related variables (which are important for determining the risk) will not be available at the time of admission to the hospital. Is my understanding of the purpose of Part 2 correct?
3. Other than the DUPERSID, the year and the RX1CD1X, is there a common ID between the hospital file and the medication file by which the can be merged? I tried merging by LINKIDX from the medication file and EVNTIDX from the hospital file. But, no match was found. For eg, for the DUPERSID 20072019, there are three records in the hospital file diagnosed for ICD1X code of 577 and 530 (repeated twice). In the medications file, for the same DUPERSID, there are 29 records (2 records for 577 and 27 records for 530). How do we relate various prescriptions for 530 in the medication file to the corresponding two records in the hospital file? Merging by DUPERSID, Year and RX1CD1X will not help here as it will be a many to many merge.
4. The variable rxtotal in the medication file is missing for more than 95% of the records. Isn’t it the total medication charges charged by the hospital which is important for the cost model?
5. There are a few variables in the data dictionary which says the variable has been edited or imputed. For example, NUMNIGHX from the hospital file and RXFORM, RXFRMUNT, etc. from the medication file are imputed variables. Is it safe not to worry about the imputation procedure?
6. In the medication file, there are a series of MULTUM THERAPEUTIC CLASS variable. I am referring to TC1 for example here. The data dictionary has the following categories – 1, 20, 28, 40, and 57 for this variable. However, in the data, 50% of records take values like 19, 97, 122, 133, 242, etc. Should we proceed as is irrespective of what the category means for this variable?