extracting tables from jpeg into a dataframe in R

Tags:

r

I have the following two links:

https://pbs.twimg.com/media/Dv3pIsIUwAEdu--.jpg:large

https://pbs.twimg.com/media/Dv3lKfjV4AAkIpY.jpg:large

The data is presented in a table format but as jpeg, I want to capture this information and turn it into a df or tibble.

I tried using tesseract but results were not good, my code is below:

library(tesseract)
text <- ocr_data(input_1, engine = eng)
text <- tesseract::ocr_data("https://pbs.twimg.com/media/Dv3lKfjV4AAkIpY.jpg:large", engine = eng)

Any ideas?

802

asked Jan 02 '19 02:01

cephalopod

1 Answers

Try some prepocessing like converting to black / white and removing the grid. This should get you started:

library(magrittr)
library(magick)
#> Linking to ImageMagick 6.9.9.38
#> Enabled features: cairo, fontconfig, freetype, fftw, ghostscript, lcms, pango, rsvg, webp, x11
#> Disabled features:

# download file
url <- "https://pbs.twimg.com/media/Dv3pIsIUwAEdu--.jpg:large"
download.file(url, destfile = "table.jpg")

# convert to black and white
convert_bw <- 'convert table.jpg -fill white -fuzz 20% +opaque "#000000" table_bw.jpg'
system(convert_bw)

# remove grid
remove_grid <- "convert table_bw.jpg -negate -define morphology:compose=darken -morphology Thinning 'Rectangle:1x80+0+0<' -negate table_wo_grid.jpg"
system(remove_grid)

# read img and ocr
data <- image_read("table_wo_grid.jpg") %>%
  image_crop(geometry_area(0, 0, 80, 25)) %>%
  image_ocr() %>%
  stringi::stri_split(fixed = "\n")

head(data[[1]])
#> [1] "10/3/2013 112.32 -0.12 0.11 0.04 0.55 0.05 0.45 555 155 5.55 143,115 23,439 505"         
#> [2] "10/5/2013 112.94 -0.44 0.15 0.04 0.53 0.05 0.45 1,572 2,255 0.75 143,091 23,335 504"     
#> [3] "10/4/2013 115.53 -0.47 0.10 0.04 0.55 0.05 0.45 27,212 4,955 775,473 142,357 27,334 5 22"
#> [4] "10/5/2013 115.35 -0.57 0.00 0.04 0.51 0.05 0.29 25,522 5,312 4.05 131,320 25,340 513"    
#> [5] "10/2/2013 114.42 -0.51 0.01 0.04 0.44 0.05 0.19 470 994 0.47 121,250 25,901 74.53"       
#> [6] "9/23/2013 11495 -0.03 0.07 0.04 0.57 0.05 0.11 20,075 594 50 55 121,437 25,341 774773"

^{Created on 2019-01-02 by the reprex package (v0.2.1)}

EDIT Transformation without system calls

library(magrittr)
library(magick)
#> Linking to ImageMagick 6.9.9.38
#> Enabled features: cairo, fontconfig, freetype, fftw, ghostscript, lcms, pango, rsvg, webp, x11
#> Disabled features:

# download file
url <- "https://pbs.twimg.com/media/Dv3pIsIUwAEdu--.jpg:large"
download.file(url, destfile = "table.jpg")

# preprocessing
img <- image_read("table.jpg") %>% 
  image_transparent("white", fuzz=82) %>% 
  image_background("white") %>%
  image_negate() %>%
  image_morphology(method = "Thinning", kernel = "Rectangle:20x1+0+0^<") %>%
  image_negate() %>%
  image_crop(geometry_area(0, 0, 80, 25)) 

img

# read img and ocr
data <- img %>%
  image_ocr() 

# some wrangling
data %>%
  stringi::stri_split(fixed = "\n") %>%
  purrr::map(~ stringi::stri_split(str = ., fixed = "‘")) %>%
  .[[1]] %>%
  purrr::map_df(~ tibble::tibble(Date = .[1], Price = .[2], Change = .[3])) %>%
  dplyr::glimpse()
#> Observations: 61
#> Variables: 3
#> $ Date   <chr> "10/3/2013", "10/5/2013", "10/4/2013", "10/5/2013", "10...
#> $ Price  <chr> "11232", "11294", "11553", "11535", "114.42", "11495", ...
#> $ Change <chr> " -0.12", " -0.44", " -0.47", " -0.57", " -0.51", " -0....

^{Created on 2019-01-03 by the reprex package (v0.2.1)}

100

answered Sep 25 '22 04:09

Birger

Related questions
                            
                                Automatic number of footnotes?
                            
                                How to stop RStudio from rendering nb.html on save of an R Markdown document
                            
                                Downloading new data from internet when package is loaded every time
                            
                                Custom loss function in H2O
                            
                                Shiny selectize-dropdown menu open in upward direction
                            
                                Some right-aligned tabPanels in shiny
                            
                                Adding a mean to geom_density_ridges
                            
                                How to use ifelse row-wise in R data.table?
                            
                                extract coordinates of points in simple feature data frame
                            
                                Sorting Data.Table Based on Multiple Columns
                            
                                Extract and Visualize Model Trees from Sparklyr
                            
                                Shiny promises future is not working on eventReactive
                            
                                Expose simple C++ Student class to R using Rcpp modules
                            
                                Convert a uneven list to a data.frame [duplicate]
                            
                                Add horizontal quantile lines to scatter plot ggplot2 R
                            
                                tilde(~) operator in R
                            
                                Drop list columns from dataframe using dplyr and select_if
                            
                                How to use dplyr programming syntax to create and evaluate variable names
                            
                                R_LIBS_USER ignored by R
                            
                                Cut elements from the beginning and end of an R vector

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With