Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion.
There are over 2,000 possible pieces of data from which the organization could theoretically want to include on a report. The data are sourced from non-human-friendly names such as:
payperiodmatchcode labordistributioncodedesc dependentrelationship actionendoption actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus bankaccountstatusdesc bankaccounttype bankaccounttypedesc beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass beneficiaryclass beneficiaryclassdesc benefitactioncode benefitactioncodedesc benefitagecontrol benefitagecontroldesc ageconrolagelimit ageconrolnoticeperiod
How would you automatically change such names to:
Use Google's Did you mean engine, however I think it violates their TOS:
lynx -dump «url» | grep "Did you mean" | awk ...
Any language is fine, but text parsers such as Perl would probably be well-suited. (The column names are English-only.)
The goal is not 100% perfection in breaking words apart; the following outcome is acceptable:
No matter what, a human will need to double-check the results and correct many. Whittling a set of 2,000 results down to 600 edits would be a dramatic time savings. To fixate on some cases having multiple possibilities (e.g., therapistname) is to miss the point altogether.
Usually, words are separated by just one white space between them. In order to split it and get the array of words, just call the split() method on input String, passing a space as regular expression i.e." ", this will match a single white space and split the string accordingly.
The split() method splits a string into an array of substrings. The split() method returns the new array. The split() method does not change the original string. If (" ") is used as separator, the string is split between words.
Description. Python string method split() returns a list of all the words in the string, using str as the separator (splits on all whitespace if left unspecified), optionally limiting the number of splits to num.
Sometimes, bruteforcing is acceptable:
#!/usr/bin/perl
use strict; use warnings;
use File::Slurp;
my $dict_file = '/usr/share/dict/words';
my @identifiers = qw(
payperiodmatchcode labordistributioncodedesc dependentrelationship
actionendoption actionendoptiondesc addresstype addresstypedesc
historytype psaddresstype rolename bankaccountstatus
bankaccountstatusdesc bankaccounttype bankaccounttypedesc
beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
beneficiaryclass beneficiaryclassdesc benefitactioncode
benefitactioncodedesc benefitagecontrol benefitagecontroldesc
ageconrolagelimit ageconrolnoticeperiod
);
my @mydict = qw( desc );
my $pat = join('|',
map quotemeta,
sort { length $b <=> length $a || $a cmp $b }
grep { 2 < length }
(@mydict, map { chomp; $_ } read_file $dict_file)
);
my $re = qr/$pat/;
for my $identifier ( @identifiers ) {
my @stack;
print "$identifier : ";
while ( $identifier =~ s/($re)\z// ) {
unshift @stack, $1;
}
# mark suspicious cases
unshift @stack, '*', $identifier if length $identifier;
print "@stack\n";
}
Output:
payperiodmatchcode : pay period match code labordistributioncodedesc : labor distribution code desc dependentrelationship : dependent relationship actionendoption : action end option actionendoptiondesc : action end option desc addresstype : address type addresstypedesc : address type desc historytype : history type psaddresstype : * ps address type rolename : role name bankaccountstatus : bank account status bankaccountstatusdesc : bank account status desc bankaccounttype : bank account type bankaccounttypedesc : bank account type desc beneficiaryamount : beneficiary amount beneficiaryclass : beneficiary class beneficiarypercent : beneficiary percent benefitsubclass : benefit subclass beneficiaryclass : beneficiary class beneficiaryclassdesc : beneficiary class desc benefitactioncode : benefit action code benefitactioncodedesc : benefit action code desc benefitagecontrol : benefit age control benefitagecontroldesc : benefit age control desc ageconrolagelimit : * ageconrol age limit ageconrolnoticeperiod : * ageconrol notice period
See also A Spellchecker Used to Be a Major Feat of Software Engineering.
I reduced your list to 32 atomic terms that I was concerned about and put them in longest-first arrangement in a regex:
use strict;
use warnings;
my $qr
= qr/ \G # right after last match
( distribution
| relationship
| beneficiary
| dependent
| subclass
| account
| benefit
| address
| control
| history
| percent
| action
| amount
| conrol
| option
| period
| status
| class
| labor
| limit
| match
| notice
| bank
| code
| desc
| name
| role
| type
| age
| end
| pay
| ps
)
/x;
while ( <DATA> ) {
chomp;
print;
print ' -> ', join( ' ', m/$qr/g ), "\n";
}
__DATA__
payperiodmatchcode
labordistributioncodedesc
dependentrelationship
actionendoption
actionendoptiondesc
addresstype
addresstypedesc
historytype
psaddresstype
rolename
bankaccountstatus
bankaccountstatusdesc
bankaccounttype
bankaccounttypedesc
beneficiaryamount
beneficiaryclass
beneficiarypercent
benefitsubclass
beneficiaryclass
beneficiaryclassdesc
benefitactioncode
benefitactioncodedesc
benefitagecontrol
benefitagecontroldesc
ageconrolagelimit
ageconrolnoticeperiod
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With