Has anyone written a validation (regex or otherwise) for ICD-10-CM?
I'm not interested in the trivial solution (3-7 alphanumeric), I'd like to know how incorporation of the 7th digit requirement was handled.
Update for COVID diagnosis codes: The letter "U" is valid as of 2020.
I just wrote a regular expression for all 2016 ICD10 codes:
/^[A-TV-Z]\d[0-9AB](?:\.([\dA-KXZ]|[\dA-KXZ][\dAX-Z]|[\dA-KXZ][\dAX-Z][\dX]|[\dA-KXZ][\dAX-Z][\dX][0-59A-HJKMNP-S]))?$/
This regex assumes the dot is present after the third character when it is supposed to be; the CDC distributes the code lists with dots omitted.
In the research I've done regarding ICD10 code structure, not all the rules and pitfalls are documented. This regex was constructed according to the codes that actually exist, because documentation found online for ICD10 structure doesn't tell the entire story.
First character is alpha, except U.
Second char is numeric.
Third char is numeric, A, or B (these letters were recently added).
Dot for codes longer than 3 characters (not referred to as a character in any descriptions of code rules).
Fourth char is numeric or letters ABCDEFGHIJKXZ.
Fifth char is numeric or letters AXYZ.
Sixth char is numeric or X.
X is as a placeholder when appearing as the fourth, fifth, or sixth character (but never the last character).
Seventh char is more complex than any reference suggests. A,D,s is for sequela. Certain other sets of codes have their own extensions; for bone fractures these are ABCDEFGHJKMNPQRS, where ADS still express sequela but may confer additional info. Codes exist that use digits 01234 in this position.
Laterality is not straightforward at all. Documentation states that 1 == right, 2 == left, which is usually true. However, 3 == bilateral, 9 == unspecified (5th char) and 0 == unspecified (6th char) are not always true.
There are many codes where laterality is represented in conjunction with something else, often which limb. In these codes, left, right, unspecified is expressed using 1,2,3; 4,5,6; 7,8,9 to represent the other factor. A doubly unspecific code using 0 may also be present.
Furthermore, the character expressing laterality is not always the last character of the first six.
The descriptions of some lateral codes suggest an additional "other" side.
ICD10 is actually a tree of codes where usable codes are the leaves, with each node containing metadata that applies to itself and all descendents.
As said in other answers, some codes may look like an ICD10 code but are actually invalid. However, they do include a flat list of all codes at
http://www.cdc.gov/nchs/icd/icd10cm.htm
This list of codes does not contain the UTF-8 encoded characters on ~50 codes, such as
H81.01 Ménière's disease, right ear
with non-ascii characters, but does contain descriptions of all 69823
usable codes. So you can tell right away that the maximum possible code cardinality of 26*10*10*10*10*10*26
is much greater than 69823
so regexes are right out.
In order to get all of the 7th character information, parsing of the XML and expanding based on 'rules' is required. And if you are looking for metadata on each of the codes, the flat code file does not have it. You will have to parse the XML for that metadata (or use an API, etc.)
An example is best:
<diag>
<name>H40.11</name>
<desc>Primary open-angle glaucoma</desc>
<inclusionTerm>
<note>Chronic simple glaucoma</note>
</inclusionTerm>
<sevenChrNote>
<note>One of the following 7th characters is to be assigned to code H40.11 to designate the stage of glaucoma</note>
</sevenChrNote>
<sevenChrDef>
<extension char="0">stage unspecified</extension>
<extension char="1">mild stage</extension>
<extension char="2">moderate stage</extension>
<extension char="3">severe stage</extension>
<extension char="4">indeterminate stage</extension>
</sevenChrDef>
</diag>
In your XML parsing, to correctly get the 7th character, you must parse the string One of the following 7th characters is to be assigned to code H40.11 to designate the stage of glaucoma
and expand the code H40.11
to each <extension>
under the <sevenChrDef></>
. So with the above example, you will get each of the codes:
H40.11X0 Primary open-angle glaucoma, stage unspecified
H40.11X1 Primary open-angle glaucoma, mild stage
H40.11X2 Primary open-angle glaucoma, moderate stage
H40.11X3 Primary open-angle glaucoma, severe stage
H40.11X4 Primary open-angle glaucoma, indeterminate stage
The X
is a 'placeholder' to ensure 7 character code length.
It gets worse...
Consider the code branch starting with T64
:
<diag>
<name>T64</name>
<desc>Toxic effect of aflatoxin and other mycotoxin food contaminants</desc>
<sevenChrNote>
<note>The appropriate 7th character is to be added to each code from category T64</note>
</sevenChrNote>
<sevenChrDef>
<extension char="A">initial encounter</extension>
<extension char="D">subsequent encounter</extension>
<extension char="S">sequela</extension>
</sevenChrDef>
<diag>
<name>T64.0</name>
<desc>Toxic effect of aflatoxin</desc>
<diag>
<name>T64.01</name>
<desc>Toxic effect of aflatoxin, accidental (unintentional)</desc>
</diag>
<diag>
<name>T64.02</name>
<desc>Toxic effect of aflatoxin, intentional self-harm</desc>
</diag>
<diag>
<name>T64.03</name>
<desc>Toxic effect of aflatoxin, assault</desc>
</diag>...
T64
is not a leaf node and is therefore not billable. However, it still has 7th character metadata. This means that you must apply or 'multiply' each child code by its <sevenCharDef>
, or A
, D
and S
, obtaining the codes:
T6401XA Toxic effect of aflatoxin, accidental (unintentional), initial encounter
T6401XD Toxic effect of aflatoxin, accidental (unintentional), subsequent encounter
T6401XS Toxic effect of aflatoxin, accidental (unintentional), sequela
T6402XA Toxic effect of aflatoxin, intentional self-harm, initial encounter
T6402XD Toxic effect of aflatoxin, intentional self-harm, subsequent encounter
T6402XS Toxic effect of aflatoxin, intentional self-harm, sequela
T6403XA Toxic effect of aflatoxin, assault, initial encounter
T6403XD Toxic effect of aflatoxin, assault, subsequent encounter
T6403XS Toxic effect of aflatoxin, assault, sequela
We will hopefully get permission to reprint/supplement the ICD10 codes in JSON format where each code has explicit metadata, but until then this is your best bet.
If all you need is to determine validity of an ICD10 code, just load up the first column of the flat file (separated by \r
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With