How to separate words in a "sentence" with spaces?

Background

Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion.

Problem

There are over 2,000 possible pieces of data from which the organization could theoretically want to include on a report. The data are sourced from non-human-friendly names such as:

payperiodmatchcode labordistributioncodedesc dependentrelationship actionendoption actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus bankaccountstatusdesc bankaccounttype bankaccounttypedesc beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass beneficiaryclass beneficiaryclassdesc benefitactioncode benefitactioncodedesc benefitagecontrol benefitagecontroldesc ageconrolagelimit ageconrolnoticeperiod

Question

How would you automatically change such names to:

pay period match code
labor distribution code desc
dependent relationship

Ideas

Use Google's Did you mean engine, however I think it violates their TOS:

lynx -dump «url» | grep "Did you mean" | awk ...

Languages

Any language is fine, but text parsers such as Perl would probably be well-suited. (The column names are English-only.)

Unnecessary Prefection

The goal is not 100% perfection in breaking words apart; the following outcome is acceptable:

enrollmenteffectivedate -> Enrollment Effective Date
enrollmentenddate -> Enroll Men Tend Date
enrollmentrequirementset -> Enrollment Requirement Set

No matter what, a human will need to double-check the results and correct many. Whittling a set of 2,000 results down to 600 edits would be a dramatic time savings. To fixate on some cases having multiple possibilities (e.g., therapistname) is to miss the point altogether.

579

asked Oct 04 '10 15:10

Dave Jarvis

2 Answers

Sometimes, bruteforcing is acceptable:

#!/usr/bin/perl

use strict; use warnings;
use File::Slurp;

my $dict_file = '/usr/share/dict/words';

my @identifiers = qw(
    payperiodmatchcode labordistributioncodedesc dependentrelationship
    actionendoption actionendoptiondesc addresstype addresstypedesc
    historytype psaddresstype rolename bankaccountstatus
    bankaccountstatusdesc bankaccounttype bankaccounttypedesc
    beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
    beneficiaryclass beneficiaryclassdesc benefitactioncode
    benefitactioncodedesc benefitagecontrol benefitagecontroldesc
    ageconrolagelimit ageconrolnoticeperiod
);

my @mydict = qw( desc );

my $pat = join('|',
    map quotemeta,
    sort { length $b <=> length $a || $a cmp $b }
    grep { 2 < length }
    (@mydict, map { chomp; $_ } read_file $dict_file)
);

my $re = qr/$pat/;

for my $identifier ( @identifiers ) {
    my @stack;
    print "$identifier : ";
    while ( $identifier =~ s/($re)\z// ) {
        unshift @stack, $1;
    }
    # mark suspicious cases
    unshift @stack, '*', $identifier if length $identifier;
    print "@stack\n";
}

Output:

payperiodmatchcode : pay period match code
labordistributioncodedesc : labor distribution code desc
dependentrelationship : dependent relationship
actionendoption : action end option
actionendoptiondesc : action end option desc
addresstype : address type
addresstypedesc : address type desc
historytype : history type
psaddresstype : * ps address type
rolename : role name
bankaccountstatus : bank account status
bankaccountstatusdesc : bank account status desc
bankaccounttype : bank account type
bankaccounttypedesc : bank account type desc
beneficiaryamount : beneficiary amount
beneficiaryclass : beneficiary class
beneficiarypercent : beneficiary percent
benefitsubclass : benefit subclass
beneficiaryclass : beneficiary class
beneficiaryclassdesc : beneficiary class desc
benefitactioncode : benefit action code
benefitactioncodedesc : benefit action code desc
benefitagecontrol : benefit age control
benefitagecontroldesc : benefit age control desc
ageconrolagelimit : * ageconrol age limit
ageconrolnoticeperiod : * ageconrol notice period

See also A Spellchecker Used to Be a Major Feat of Software Engineering.

answered Oct 14 '22 02:10

Sinan Ünür

I reduced your list to 32 atomic terms that I was concerned about and put them in longest-first arrangement in a regex:

use strict;
use warnings;

my $qr 
    = qr/ \G # right after last match
          ( distribution 
          | relationship 
          | beneficiary 
          | dependent 
          | subclass 
          | account
          | benefit 
          | address 
          | control 
          | history
          | percent 
          | action 
          | amount
          | conrol 
          | option 
          | period 
          | status 
          | class 
          | labor 
          | limit 
          | match 
          | notice
          | bank
          | code 
          | desc 
          | name 
          | role 
          | type 
          | age 
          | end 
          | pay
          | ps 
          )
    /x;

while ( <DATA> ) { 
    chomp;
    print;
    print ' -> ', join( ' ', m/$qr/g ), "\n";
}

__DATA__
payperiodmatchcode
labordistributioncodedesc
dependentrelationship
actionendoption
actionendoptiondesc
addresstype
addresstypedesc
historytype
psaddresstype
rolename
bankaccountstatus
bankaccountstatusdesc
bankaccounttype
bankaccounttypedesc
beneficiaryamount
beneficiaryclass
beneficiarypercent
benefitsubclass
beneficiaryclass
beneficiaryclassdesc
benefitactioncode
benefitactioncodedesc
benefitagecontrol
benefitagecontroldesc
ageconrolagelimit
ageconrolnoticeperiod

answered Oct 14 '22 03:10

Axeman

Related questions
                            
                                Using bash variables in Makefile
                            
                                Can I make tab-completion filter files by extension?
                            
                                How does one enter a Python virtualenv when executing a bashscript?
                            
                                Retaining n most recent directories in a backup script
                            
                                In bash, can the file operator (-f) be case-insensitive?
                            
                                Rename files using sed and mv
                            
                                Bash: How do I make sub-processes of a script be terminated, when the script is terminated?
                            
                                Bash while loop that reads file line by line
                            
                                Using a count variable in a file name
                            
                                Using colored output for awk, or grep multiple pattern search in and condition
                            
                                Add json array element with jq (cmdline) [closed]
                            
                                BASH : Difference between '-' and '--' options
                            
                                Returning output from bash script to calling C++ function
                            
                                In BASH convert a string with . in float
                            
                                Can I use a variable in a file path in bash? If so, how?
                            
                                what exactly is .bash_profile.pysave?
                            
                                Enable vi mouse wheel scrolling using bash on ubuntu on windows 10
                            
                                show-all-if-ambiguous vs show-all-if-unmodified?
                            
                                Set an environment variable (password) in a way its value is not saved to the bash history
                            
                                How to echo "$@" so the result is valid bash and maintains proper quoting?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to separate words in a "sentence" with spaces?

Tags:

bash

awk

perl

text-segmentation

nlp