I have an grayscale image of a comic strip page that features several dialogue bubbles (=speech baloons, etc), that are enclosed areas with white background and solid black borders that contain text inside, i.e. something like that:
I want to detect these regions and create a mask (binary is ok) that will cover all the inside regions of dialogue bubbles, i.e. something like:
The same image, mask overlaid, to be totally clear:
So, my basic idea of the algorithm was something like:
Use a flood fill or some sort of graph traversal, starting from every white pixel detected as a pixel-inside-bubble on step 1, but working on initial image, flooding white pixels (which are supposed to be inside the bubble) and stopping on dark pixels (which are supposed to be borders or text).
Use some sort of binary_closing operation to remove dark areas (i.e. regions that correspond to text) inside bubbles). This part works ok.
So far, steps 1 and 3 work, but I'm struggling with step 2. I'm currently working with scikit-image, and I don't see any ready-made algorithms like flood fill implemented there. Obviously, I can use something trivial like breadth-first traversal, basically as suggested here, but it's really slow when done in Python. I suspect that intricate morphology stuff like binary_erosion or generate_binary_structure in ndimage or scikit-image, but I struggle to understand all that morphology terminology and basically how do I implement such a custom flood fill with it (i.e. starting with step 1 image, working on original image and producing output to separate output image).
I'm open to any suggestions, including ones in OpenCV, etc.
Speech bubbles are read in a similar way to frames. We start by reading the highest one, and then the dialogue unspools from top to bottom. Once you have composed your page, it is a good idea to trace a line from one speech bubble to the next, in the order that you want people to read them.
Speech balloons (also speech bubbles, dialogue balloons, or word balloons) are a graphic convention used most commonly in comic books, comics, and cartoons to allow words (and much less often, pictures) to be understood as representing a character's speech or thoughts.
Bubbles are placed on a page in a precise order. We always start by reading the bubble that is highest in the frame, then the next one down, and so on. When two or more frames are next to each other, we read them from left to right. The tip of the tail points to the character who is speaking.
The visual tool used to represent the speech, dialogue, or conversation of the characters in the comics is called “bubble”. The meaning of speech bubbles in comics will be addressed, with emphasis on their proper use and comics grammar.
Even though your actual question is concerning step 2 of your processing pipeline, I would like to suggest another approach, that might be, imho, simpler and as you stated that you are open to suggestions.
Using the image from your original step 1 you could create an image without text in the bubbles.
Implemented
Detect edges on the original image with removed text. This should work well for the speech bubbles, as the bubble edges are pretty distinct.
Edge detection
Finally use the edge image and the initially detected "text locations" in order to find those areas within the edge image that contain text.
Watershed-Segmentation
I am sorry for this very general answer, but here it's too late for actual coding for me, but if the question is still open and you need/want some more hints concerning my suggestion, I will elaborate it in more detail. But you could definitely have a look at the Region based segmentation in the scikit-image docs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With