Monday, April 7, 2008

Data Issues In RX0304

This file is CSV with quoted fields if a comma is part of the field.

1. There are a few instances where a quote is not part of a matching pair which throws the values off by one column. Typically the pattern is a GX qualifier followed by a dimension that has a quote in it, but it is the same quote that is used to delimit the field.

Here are the line numbers that I have checked that have this issue.

40490, 347931, 347932, 347933, 347934, 347935, 371124, 371125, 371126, 371144, 371152, 417874, 417875, 417876, 469465, 469466, 566255, 621479...there are more later.

2. There are two instances where the PHARTP1 has invaild value of "%"

139722, 139723

Be aware since this can cause problems with import routines.



Nick said...

Turns out that our conversion from a proprietary format to CSV was not as problem-free as one would hope. Sorry about that. On the other hand, some data cleaning should be expected, right?

I found 76 instances of a quote (") used to mean "inches" tucked inside a pair of quotes delimiting a string, along with one other very interesting usage. Much more common (20K instances or so) was a comma (,) inside a string, which depending on your conversion method might cause the record to look like it has an extra field.

The "%" issue will need to be addressed by our data expert who is not currently available.

Igor said...

For those interested, cleaned file can be downloaded from my site at

http: // / files /

after removing spaces from the url. File is about 19mb.

"Quotes" issue is fixed. % issue no longer appears after quotes are fixed - must have been related.



Dmitriy said...


Just beginning the import routines. I don't think the quote issue is affecting the % issue, since % is in the raw file.

Nick, any news from your data expert?



Dmitriy said...

Also, Igor - how come the cleaned file doubled in size?