Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A spell corrector code in Delphi?

Tags:

delphi

This question is to discuss how to code a spell corrector and is not duplicate of Delphi Spell Checker component.

Two years ago, I found and used the code of spell corrector by Peter Norvig at his website in Python. But the performance seemed not high. Quite interestingly, more languages that implement the same task have been appended in his webpage list recently.

Some lines in Peter's page include syntax like:

[a + c + b     for a, b in splits for c in alphabet]

How to translate it into delphi?

I am interested in how Delphi expert at SO will use the same theory and do the same task with some suitable lines and possible mediocre or better performance. This is not to downvote any language but to learn to compare how they implement the task differently.

Thanks so much in advance.

[Edit]

I will quote Marcelo Toledo who contributes C version, as saying "...While the purpose of this article [C version] was to show the algorithms, not to highlight Python...". Though his C version is with the second most lines, according to his article, his version is high performance when the dictionary file is huge. So this question is not highlight any language but to ask for delphi solution and it is not at all intended for competition, though Peter is influential in directing Google Research.

[Update]

I was enlightened by David's suggestion and studied theory and routine of Peter's page. A very rough and inefficient routine was done, slightly different from other languages, mine is GUI's. I am a beginner and learner in Delphi, I dare not post my complete code ( it is poorly written). I will outline my idea of how I did it. Your comment is welcome so that the routine will be improved.

My hardware and software is old. This is enough to my work (my specialization is not in computer or program related)

AMD Athlon Dual Core Processor
2.01 Ghz, 480 Memory
Windows XP SP2
IDE Delphi 7.0

This is the snapshot and record of processing time of 'correct' word. I tried Gettickcount, Tdatetime, and Queryperformancecounter to track correct time for word, but gettickcount and Tdatetime will output o ms for each check, so I have to use Queryperformancecounter. Maybe there are other ways to do it more precisely.

The total lines is 72, not including function that records check time. Number of lines may not be yardstick as mentioned above by Marcelo. The post is discuss how to do the task differently. Delphi Experts at SO will of course use minimum lines to do it with best performance.

Spell Checker

procedure Tmajorform.FormCreate(Sender: TObject);
begin
loaddict;
end;

procedure Tmajorform.loaddict;
var
fs: TFilestream;
templist: TStringlist;
p1: tperlregex;
w1: string;
begin
//load that big.txt (6.3M, is Adventures of Sherlock Holmes)
//templist.loadfromstream
//Use Tperlregex to tokenize ( I used regular expression by [Jan Goyvaerts][5])
//The load and tokenize time is about 7-8 seconds on my machine, Maybe there are other ways to
//speed up loading and tokenizing.
end;

procedure Tmajorform.edits1(str: string);
var
i: integer;
ch: char;
begin 
// This is to simulate Peter's page in order to fast generate all possible combinations.
// I do not know how to use set in delphi. I used array.
// Peter said his routine edits1 would generate 494 elements of 'something'. Mine will 
// generate 469. I do not know why. Before duplicate ignore, mine is over 500. After setting
// duplicate ignore, there are 469 unique elements for 'something'.
end;

procedure Tmajorform.correct(str: string);
var
i, j: integer;
begin
//This is a loop and binary search to add candidate word into list.
end;

procedure Tmajorform.Button2Click(Sender: TObject);
var
str: string;
begin
// Trigger correct(str: string);
end;

It seems by Tfilestream it can increase loading by 1-2 second. I tried using CreateFileMapping method but failed and it seemed a little complicated. Maybe there are other ways to load huge file fast. Because this big.txt will not be big considering availability of corpus, there should be more efficient way to load larger and larger file.

Another point is Delphi 7.0 does not have built-in regular expression. I have a look at other languages that do spell check at Perter's page, they are largely Directly calling their built-in regular expression. Of course, real expert does not need any built-in class or library and can build by himself. To beginner, some classes or libraries are convenience.

Your comment is welcome.

[Update]

I continued the research and further included edits2 function (edit distance 2). This will increase about another 12 lines of code. Peter said edit distance 2 would include almost all possibilities. 'something' will have 114,324 possibilities. My function will generate 102,727 UNIQUE possibilities for it. Of course, suggested words will also include more.

If with edits2, reponse time for correction obviously delay as it increases data by about 200 times. But I find some suggested corrections are obviously impossibilities as a typist will not type a error word that will be in the long corrected word list. So, edit distance 1 will be better provided that the big.txt file is sufficiently big to include more correct words.

Below is the snapshot of tracking edits 2 correct time.

enter image description here

like image 232
Dylan Avatar asked Aug 17 '11 07:08

Dylan


People also ask

What is the spell check command?

Check and correct the spelling and grammar Open the document you want to check for spelling or grammar mistakes, and then press F7.

What is the spell check tool?

In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic dictionary, or search engine.

How do I enable correction in VS Code?

Right click on word and click show spell check configuration info from menu . A spell checker tab will appear then click on file info and deselect spell checker enabled for file type(in my case it was typescript).

Does AbiWord have a spell checker?

AbiWord supports spell checking in many languages, and allows you to have different parts of a single document in different languages.


1 Answers

This is a Python list comprehension. It forms the Cartesian product of splits and alphabets.

Each item of splits is a tuple which is unpacked into a and b. Each item of alphabet is put into a variable called c. Then the 3 variables are concatenated, assuming that they are strings. The result of the list comprehension expression is a list containing elements of the form a + c + b, one element for each item in the Cartesian product.

In Python it could be written equivalently as

res = []
for a, b in splits:
  for c in alphabets:
    res.append(a + c + b)

In Delphi it would be

res := TStringList.Create;
for split in splits do
  for c in alphabets do
    res.Add(split.a + c + split.b);

I suggest you read up on Python list comprehensions to get a better understanding of this very powerful Python feature.

like image 172
David Heffernan Avatar answered Oct 14 '22 14:10

David Heffernan