Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert image to table

I have an image of a table (in my case .gif) and want to extract the table it was (ideally, .ods).

Is there any way to do so? (doing it manually is discarted, since the table has more than 1000 rows and 6 columns)

Here is a part of the image / table: enter image description here

like image 419
Masclins Avatar asked Apr 19 '17 14:04

Masclins


Video Answer


2 Answers

You will be able to get most of it through OCR, but you'll need to manually verify the data and fix some inaccuracies that will be there. It definitely won't be perfect.

First thing to do is to ensure you have a good quality image for the OCR software:

Here's what I did with your sample png (I'm using Windows):

  1. I opened the image in The Gimp.
  2. Removed the orange/blue backgrounds:

    a) Select -> By Color and clicked the blue background

    b) I held down Shift and clicked the orange background (this will add it to the current selection)

    c) Edit -> Fill With BG Color (this sets it to white)

    d) Ctrl-Shift-A to cancel the selection

  3. I removed the partially cut off '305' line:

    a) used the Rectangular Select tool button from the palette, and filled the selection with BG Color, as above

  4. Let's remove the table border:

    a) Click the 'Fuzzy Select' tool button from the palette

    b) Click somewhere on the table border (you should see the 'marching ants' instead of the border)

    c) Edit -> Fill With BG Color

    d) Ctrl-Shift-A to cancel the selection again

  5. We need to increase the number of pixels that the numbers use so that the OCR can better detect their shapes

    a) Image -> Scale Image. I chose to scale by 1000% with Linear Interpolation (the other interpolations won't work as well)

  6. Download and install Tesseract from GitHub

    a) At the command prompt type (include the double-quotes to cope with spaces within the path, & change your paths as necessary): "D:\Program Files (x86)\Tesseract-OCR\tesseract" "d:\temp\your_image.png" "d:\temp\your_txt_file_output"

  7. The output with be a text file with an appended .txt extension. It will still have a few artifacts but we can easily correct those in Notepad++ (or similar):

    a) The commas were seen as full-stops, so I did a Find and Replace of "." with "," (I'm assuming you don't have any decimal points in the data!)

    b) There were some spaces before a few commas, so I did Find and Replace " ," with "," (note I included a space before the comma in the Find)

    c) There were still some spaces in the numbers, so I did a Find and Replace of " " with "" (a space with an empty replace)

This gave the following result:

298
299
300
301
302
303
304

910,820,000
920,820,000
930,820,000
941,820,000
952,820,000
983,820,000
9?4,820,000

210,000
220,000
220,000
220,000
220,000
220,000
220,000

2,500
2,500
3,000
3,000
3,000
3,000
3,000

19,000
19,000
20,000
20,000
20,000
20,000
20,000

Note the question mark in the place of 7 in the second block of text. Things like that still need to be tidied up.

Lastly, you'd copy and paste the rows of text into your spreadsheet etc.

like image 183
K Scandrett Avatar answered Sep 18 '22 03:09

K Scandrett


I wanted to post another option I finally found online.

https://convertio.co/es/ocr/

Even though I think K Scandrett answer deserves to be the correct one, since it doesn't rely on a URL, which might go down.

like image 30
Masclins Avatar answered Sep 19 '22 03:09

Masclins