Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Plot digitization - scraping sample values from an image of a graph

This isn't really "OCR", since it's not recognizing characters, but it's the same idea applied to curves. Anyone know of an image-processing library or established algorithm for retrieving the values from a (raster) plot image? For instance, in this graph, it's hard for me to read exact values with my eyes because there's such gaps between gridlines:

alt text

I can use a straight edge or whatever, but it's still going to be error-prone. It would be great if there were software that could just take a screenshot of any old graph and automatically convert it into a table of values or a function that could be queried.

Seems to be called "curve recognition"? Could also be used for extracting data from the curves in scientific papers for which the underlying data is not published.

And it's ok to have some human guidance. There's no reason an OCR couldn't read the "100" and match it up with the line, for instance, but it's ok to have a human give the lines numerical values after the machine has extracted the curve's path relative to the gridlines. I'm mostly interested in the function of tracing the curve relative to the grid, even if the grid is tilted, rotated, or warped in a non-affine way.

Update:

There is now a Wikipedia article called Converting scanned graphs to data with a bunch of software in the links. Also some software on alternativeto.net. I guess the theory belongs on http://dsp.stackexchange.com now, while the software solutions belong on http://superuser.com?

like image 816
endolith Avatar asked Nov 01 '09 18:11

endolith


1 Answers

This is extremely hard and error-prone. (We do this sort of thing a lot in chemistry where we try to analyze chemistry.) It depends critically on various parameters and conditions.

  1. Is the image a bit-map (pixels-only) or vectors (EMF, WMF, SVG, PS, PDF...)? Vectors are vastly better than pixels. We tackle vectors (including PDF) but don't touch pixels. Some of our collbaorators will try to use pixels but only on fairly recent documents.
  2. If you are stuck with pixels then are your images all from the same source? If so you have a small chance of extracting font information. I am afraid your image is so poor that it would require a great deal of work. However if you can work out the font you have a chance of extracting text and numbers if all docs are from the same source. You could use heuristics (rules such as where the numbers might be) or machine-learning (a list of features on whioch the methods can be trained).
  3. Your image appears to have been scanned (as the axes are pixelated). That makes it even worse. What appears a straight line to the eye is horrible for a machine. Is your image skewed on the page? You may have to deskew it.
  4. If you have a model for the lines and curves then you may have a change of modelling expected parameters into the image. But it's not trivial.

I'm sorry to be pessimistic. If you really want the info then it can be done with a lot of investment or collaboration with groups which do this sort of thing.

like image 126
peter.murray.rust Avatar answered Sep 21 '22 16:09

peter.murray.rust