Okay. So I have about 250,000 high resolution images. What I want to do is go through all of them and find ones that are corrupted. If you know what 4scrape is, then you know the nature of the images I.
Corrupted, to me, is the image is loaded into Firefox and it says
The image “such and such image” cannot be displayed, because it contains errors.
Now, I could select all of my 250,000 images (~150gb) and drag-n-drop them into Firefox. That would be bad though, because I don't think Mozilla designed Firefox to open 250,000 tabs. No, I need a way to programmatically check whether an image is corrupted.
Does anyone know a PHP or Python library which can do something along these lines? Or an existing piece of software for Windows?
I have already removed obviously corrupted images (such as ones that are 0 bytes) but I'm about 99.9% sure that there are more diseased images floating around in my throng of a collection.
If the image in JPEG, use this: JPEGImageDecoder decoder = new JPEGImageDecoder(new FileImageSource(f) ,new FileInputStream(f)); decoder. produceImage(); if it throws an exception; this means the image is corrupted.
An easy way would be to try loading and verifying the files with PIL (Python Imaging Library).
from PIL import Image
v_image = Image.open(file)
v_image.verify()
Catch the exceptions...
From the documentation:
im.verify()
Attempts to determine if the file is broken, without actually decoding the image data. If this method finds any problems, it raises suitable exceptions. This method only works on a newly opened image; if the image has already been loaded, the result is undefined. Also, if you need to load the image after using this method, you must reopen the image file.
i suggest you check out imagemagick for this: http://www.imagemagick.org/
there you have a tool called identify which you can either use in combination with a script/stdout or you can use the programming interface provided
In PHP, with exif_imagetype():
if (exif_imagetype($filename) === false)
{
unlink($filename); // image is corrupted
}
EDIT: Or you can try to fully load the image with ImageCreateFromString():
if (ImageCreateFromString(file_get_contents($filename)) === false)
{
unlink($filename); // image is corrupted
}
An image resource will be returned on success. FALSE is returned if the image type is unsupported, the data is not in a recognized format, or the image is corrupt and cannot be loaded.
If your exact requirements are that it show correctly in FireFox you may have a difficult time - the only way to be sure would be to link to the exact same image loading source code as FireFox.
Basic image corruption (file is incomplete) can be detected simply by trying to open the file using any number of image libraries.
However many images can fail to display simply because they stretch a part of the file format that the particular viewer you are using can't handle (GIF in particular has a lot of these edge cases, but you can find JPEG and the rare PNG file that can only be displayed in specific viewers). There are also some ugly JPEG edge cases where the file appears to be uncorrupted in viewer X, but in reality the file has been cut short and is only displaying correctly because very little information has been lost (FireFox can show some cut off JPEGs correctly [you get a grey bottom], but others result in FireFox seeming the load them half way and then display the error message instead of the partial image)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With