Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract in R not recognizing "&"

I am a beginner in R programming and a supposed to write a code to read in text from images! I am using the Tesseract and Magick packages for doing the same and am facing an issue where the code converts an "&" to "8:" I have attached the image that I am using as an input. Image used for processing

Below is the code that I am running

test2 <- image_read("C:/Users/admin/Desktop/testimage.jpg") %>%
  image_resize("2000") %>%
  image_convert(colorspace = 'gray') %>%
  image_trim() %>%
  image_ocr()
cat(test2)
write.table(test2, "C:/Users/admin/Desktop/output2.txt", sep="\t")

Below is the output that I am getting

No relation between boycotting
panchayat polls 8: Article 35A:
Subramanian Swamy

I have referred to the following source to gain some understanding but did not find any suitable solution for this specific problem.

I have also gone through this website but did not find much help in reading in special characters.

If someone can help me, that would be really helpful.

like image 870
H Dave Avatar asked Sep 19 '18 16:09

H Dave


1 Answers

Can you use Imagemagick with a TIF instead of a JPG to do the same ? I used the below query and it worked.

test20 <- image_read("E:/xx/image.tif") %>%
image_resize("4000") %>%
image_convert(colorspace = 'gray') %>%
image_trim() %>%
image_ocr()
cat(test20)
write.table(test2, "E:/xx/output.txt", sep="\t")
like image 150
Ronnie Avatar answered Oct 28 '22 03:10

Ronnie