I am using a OCR algorithm (tesseract based) which has difficulties with recognizing certain characters. I have partially solved that by creating my own "post-processing hash-table" which includes pairs of characters. For example, since the text is just numbers, I have figured out that if there is Q character inside the text, it should be 9 instead.
However I have a more serious problem with 6 and 8 characters since both of them are recognized as B. Now since I know what I am looking for (when I am translating the image to text) and the strings are fairly short (6~8 digits), I thought to create strings with all possible combinations of 6 and 8 and compare each one of them to the one I am looking for.
So for example, I have the following string recognized by the OCR:
L0B7B0B5
So each B here can be 6 or 8.
Now I want to generate a list like the below:
L0878085
L0878065
L0876085
L0876065
.
.
So it's kind of binary table with 3 digits and in this case there are 8 options. But the amount of B characters in string can be other than 3 (it can be any number).
I have tried to use Python itertools module with something like that:
list(itertools.product(*["86"] * 3))
Which will provide the following result:
[('8', '8', '8'), ('8', '8', '6'), ('8', '6', '8'), ('8', '6', '6'), ('6', '8', '8'), ('6', '8', '6'), ('6', '6', '8'), ('6', '6', '6')]
which I assume I can then later use to swap B characters. However, for some reason I can't make itertools work in my environment. I assume it has something to do the fact I am using Jython and not pure Python.
I will be happy to hear any other ideas as how to complete this task. Maybe there is a simpler solution I didn't think of?
itertools.product accepts a repeat keyword that you can use:
In [92]: from itertools import product
In [93]: word = "L0B7B0B5"
In [94]: subs = product("68", repeat=word.count("B"))
In [95]: list(subs)
Out[95]:
[('6', '6', '6'),
('6', '6', '8'),
('6', '8', '6'),
('6', '8', '8'),
('8', '6', '6'),
('8', '6', '8'),
('8', '8', '6'),
('8', '8', '8')]
Then one fairly concise method to make the substitutions is to do a reduction operation with the string replace method:
In [97]: subs = product("68", repeat=word.count("B"))
In [98]: [reduce(lambda s, c: s.replace('B', c, 1), sub, word) for sub in subs]
Out[98]:
['L0676065',
'L0676085',
'L0678065',
'L0678085',
'L0876065',
'L0876085',
'L0878065',
'L0878085']
Another method, using a couple more functions from itertools:
In [90]: from itertools import chain, izip_longest
In [91]: subs = product("68", repeat=word.count("B"))
In [92]: [''.join(chain(*izip_longest(word.split('B'), sub, fillvalue=''))) for sub in subs]
Out[92]:
['L0676065',
'L0676085',
'L0678065',
'L0678085',
'L0876065',
'L0876085',
'L0878065',
'L0878085']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With