Medical information extraction using Python

Tags:

I am a nurse and I know python but I am not an expert, just used it to process DNA sequences
We got hospital records written in human languages and I am supposed to insert these data into a database or csv file but they are more than 5000 lines and this can be so hard. All the data are written in a consistent format let me show you an example

11/11/2010 - 09:00am : He got nausea, vomiting and died 4 hours later

I should get the following data

Sex: Male
Symptoms: Nausea
    Vomiting
Death: True
Death Time: 11/11/2010 - 01:00pm

Another example

11/11/2010 - 09:00am : She got heart burn, vomiting of blood and died 1 hours later in the operation room

And I get

Sex: Female
Symptoms: Heart burn
    Vomiting of blood
Death: True
Death Time: 11/11/2010 - 10:00am

the order is not consistent by when I say in ....... so in is a keyword and all the text after is a place until i find another keyword
At the beginnning He or She determine sex, got ........ whatever follows is a group of symptoms that i should split according to the separator which can be a comma, hypen or whatever but it's consistent for the same line
died ..... hours later also should get how many hours, sometimes the patient is stil alive and discharged ....etc
That's to say we have a lot of conventions and I think if i can tokenize the text with keywords and patterns i can get the job done. So please if you know a useful function/modules/tutorial/tool for doing that preferably in python (if not python so a gui tool would be nice)

Some few information:

there are a lot of rules to express various medical data but here are few examples
- Start with the same date/time format followed by a space followd by a colon followed by a space followed by He/She followed space followed by rules separated by and
- Rules:
    * got <symptoms>,<symptoms>,....
    * investigations were done <investigation>,<investigation>,<investigation>,......
    * received <drug or procedure>,<drug or procedure>,.....
    * discharged <digit> (hour|hours) later
    * kept under observation
    * died <digit> (hour|hours) later
    * died <digit> (hour|hours) later in <place>
other rules do exist but they follow the same idea

779

asked Oct 25 '10 02:10

Nurse

2 Answers

This uses dateutil to parse the date (e.g. '11/11/2010 - 09:00am'), and parsedatetime to parse the relative time (e.g. '4 hours later'):

import dateutil.parser as dparser
import parsedatetime.parsedatetime as pdt
import parsedatetime.parsedatetime_consts as pdc
import time
import datetime
import re
import pprint
pdt_parser = pdt.Calendar(pdc.Constants())   
record_time_pat=re.compile(r'^(.+)\s+:')
sex_pat=re.compile(r'\b(he|she)\b',re.IGNORECASE)
death_time_pat=re.compile(r'died\s+(.+hours later).*$',re.IGNORECASE)
symptom_pat=re.compile(r'[,-]')

def parse_record(astr):    
    match=record_time_pat.match(astr)
    if match:
        record_time=dparser.parse(match.group(1))
        astr,_=record_time_pat.subn('',astr,1)
    else: sys.exit('Can not find record time')
    match=sex_pat.search(astr)    
    if match:
        sex=match.group(1)
        sex='Female' if sex.lower().startswith('s') else 'Male'
        astr,_=sex_pat.subn('',astr,1)
    else: sys.exit('Can not find sex')
    match=death_time_pat.search(astr)
    if match:
        death_time,date_type=pdt_parser.parse(match.group(1),record_time)
        if date_type==2:
            death_time=datetime.datetime.fromtimestamp(
                time.mktime(death_time))
        astr,_=death_time_pat.subn('',astr,1)
        is_dead=True
    else:
        death_time=None
        is_dead=False
    astr=astr.replace('and','')    
    symptoms=[s.strip() for s in symptom_pat.split(astr)]
    return {'Record Time': record_time,
            'Sex': sex,
            'Death Time':death_time,
            'Symptoms': symptoms,
            'Death':is_dead}


if __name__=='__main__':
    tests=[('11/11/2010 - 09:00am : He got nausea, vomiting and died 4 hours later',
            {'Sex':'Male',
             'Symptoms':['got nausea', 'vomiting'],
             'Death':True,
             'Death Time':datetime.datetime(2010, 11, 11, 13, 0),
             'Record Time':datetime.datetime(2010, 11, 11, 9, 0)}),
           ('11/11/2010 - 09:00am : She got heart burn, vomiting of blood and died 1 hours later in the operation room',
           {'Sex':'Female',
             'Symptoms':['got heart burn', 'vomiting of blood'],
             'Death':True,
             'Death Time':datetime.datetime(2010, 11, 11, 10, 0),
             'Record Time':datetime.datetime(2010, 11, 11, 9, 0)})
           ]

    for record,answer in tests:
        result=parse_record(record)
        pprint.pprint(result)
        assert result==answer
        print

yields:

{'Death': True,
 'Death Time': datetime.datetime(2010, 11, 11, 13, 0),
 'Record Time': datetime.datetime(2010, 11, 11, 9, 0),
 'Sex': 'Male',
 'Symptoms': ['got nausea', 'vomiting']}

{'Death': True,
 'Death Time': datetime.datetime(2010, 11, 11, 10, 0),
 'Record Time': datetime.datetime(2010, 11, 11, 9, 0),
 'Sex': 'Female',
 'Symptoms': ['got heart burn', 'vomiting of blood']}

Note: Be careful parsing dates. Does '8/9/2010' mean August 9th, or September 8th? Do all the record keepers use the same convention? If you choose to use dateutil (and I really think that's the best option if the date string is not rigidly structured) be sure to read the section on "Format precedence" in the dateutil documentation so you can (hopefully) resolve '8/9/2010' properly. If you can't guarantee that all the record keepers use the same convention for specifying dates, then the results of this script would have be checked manually. That might be wise in any case.

108

answered Oct 05 '22 04:10

unutbu

Here are some possible way you can solve this -

Using Regular Expressions - Define them according to the patterns in your text. Match the expressions, extract pattern and you repeat for all records. This approach needs good understanding of the format in which the data is & of course regular expressions :)
String Manipulation - This approach is relatively simpler. Again one needs a good understanding of the format in which the data is. This is what I have done below.
Machine Learning - You could define all you rules & train a model on these rules. After this the model tries to extract data using the rules you provided. This is a lot more generic approach than the first two. Also the toughest to implement.

See if this work for you. Might need some adjustments.

new_file = open('parsed_file', 'w')
for rec in open("your_csv_file"):
    tmp = rec.split(' : ')
    date = tmp[0]
    reason = tmp[1]

    if reason[:2] == 'He':
        sex = 'Male'
        symptoms = reason.split(' and ')[0].split('He got ')[1]
    else:
        sex = 'Female'
        symptoms = reason.split(' and ')[0].split('She got ')[1]
    symptoms = [i.strip() for i in symptoms.split(',')]
    symptoms = '\n'.join(symptoms)
    if 'died' in rec:
        died = 'True'
    else:
        died = 'False'
    new_file.write("Sex: %s\nSymptoms: %s\nDeath: %s\nDeath Time: %s\n\n" % (sex, symptoms, died, date))

Ech record is newline separated \n & since you did not mention one patient record is 2 newlines separated \n\n from the other.

LATER: @Nurse what did you end up doing? Just curious.

answered Oct 05 '22 04:10

Srikar Appalaraju

Related questions
                            
                                AttributeError: module 'cv2.cv2' has no attribute 'cv'
                            
                                Python pandas - new column's value if the item is in the list
                            
                                Find indices of duplicate rows in pandas DataFrame
                            
                                Changing the fill_values in a SparseDataFrame - replace throws TypeError
                            
                                VS Code shows an error message at print statement in python 2.7
                            
                                numpy.unique sort based on counts
                            
                                How to construct an in-memory virtual file system and then write this structure to disk
                            
                                Why is Cython so much slower than Numba when iterating over NumPy arrays?
                            
                                I'm unable to install opencv-contrib-python in docker
                            
                                How to redirect/render Pyodide output in browser?
                            
                                Combine year, month and day in Python to create a date
                            
                                Get the nearest distance with two geodataframe in pandas
                            
                                How to solve 'Getting Default Adapter failed' error when launching Chrome and try to access a webpage using the ChromeDriver using Selenium
                            
                                Filtering by relation count in SQLAlchemy
                            
                                Highlighting unmatched brackets in vim
                            
                                django binary (no source code) deployment
                            
                                Extract the fields of a C struct
                            
                                First Order Logic Engine
                            
                                Python base classes share attributes? [duplicate]
                            
                                Python: Can a class forbid clients setting new attributes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Medical information extraction using Python

Tags:

python

parsing

machine-learning

nlp

information-extraction

Nurse

People also ask

2 Answers

unutbu

Srikar Appalaraju

Recent Activity

Donate For Us