natural language processing fix for combined words

Tags:

I have some text that was generate by another system. It combined some words together in what I assume was some sort of wordwrap by-product. So something simple like 'the dog' is combine into 'thedog'.

I checked the ascii and unicode string to see is there wasn't some unseen character in there, but there wasn't. A confounding problem is that this is medical text and a corpus to check against aren't that available. So, real example is '...test to rule out SARS versus pneumonia' ends up as '... versuspneumonia.'

Anyone have a suggestion for finding and separating these?

311

asked Mar 15 '11 23:03

rich

1 Answers

This may be of interest to you http://www.perlmonks.org/?node_id=336331

You can probably use the medical nature of the text to your advantage by using two dictionaries, one containing only medical terminology and one of general English.

If you can isolate out medical words then run the rest of the string against the general dictionary you should get some decent results.

110

answered Oct 12 '22 10:10

Finbar Crago

Related questions
                            
                                Handling conjunctions when splitting sentences using core-nlp's DocumentPreprocessor
                            
                                Regex for replacing first 5 numbers, irrespective of anything between them?
                            
                                Lazy quantifier and lookahead
                            
                                How do I ignore JAVA tests in Coverity Connect analysis result?
                            
                                Regular expression for word boundaries but including emojis [duplicate]
                            
                                Why does a lookahead in an optional 0-width capture group prevent the group from matching?
                            
                                RegEx match for paragraphs
                            
                                Regex and proper capture using .matches .Concat in C#
                            
                                Trying NOT to match a Japanese word using RegEx negative lookbehind
                            
                                Use Scala Iterator to break up large stream (from string) into chunks using a RegEx match, and then operate on those chunks?
                            
                                Regex for 3 types of address
                            
                                How to use regex to extract text in order?
                            
                                How do I restrict the style of the text pasted in a contenteditable area?
                            
                                C++ regex bug! Square bracket expression does not work with icase flag
                            
                                JFormattedTextField using a regular expression formatter?
                            
                                Sanitize a string with non-alphanum repetition
                            
                                Replacing words with tag links in PHP
                            
                                Regular expression for counting sentences in a block of text [duplicate]
                            
                                MySQL REGEXP to SQL Server
                            
                                Fixing street names with regex

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

natural language processing fix for combined words

Tags:

regex

nlp

rich

People also ask

1 Answers

Finbar Crago

Recent Activity

Donate For Us