I know how to read binary files in Python using NumPy's np.fromfile()
function. The issue I'm faced with is that when I do so, the array has exceedingly large numbers of the order of 10^100 or so, with random nan
and inf
values.
I need to apply machine learning algorithms to this dataset and I cannot work with this data. I cannot normalise the dataset because of the nan
values.
I've tried np.nan_to_num()
but that doesn't seem to work. After doing so, my min and max values range from 3e-38 and 3e+38 respectively, so I could not normalize it.
Is there any way to scale this data down? If not, how should I deal with this?
Thank you.
EDIT:
Some context. I'm working on a malware classification problem. My dataset consists of live malware binaries. They are files of the type .exe, .apk etc. My idea is store these binaries as a numpy array, convert to a grayscale image and then perform pattern analysis on it.
To open a file in binary format, add 'b' to the mode parameter. Hence the "rb" mode opens the file in binary format for reading, while the "wb" mode opens the file in binary format for writing. Unlike text files, binary files are not human-readable. When opened using any text editor, the data is unrecognizable.
If you want to make an image out of a binary file, you need to read it in as integer, not float. Currently, the most common format for images is unsigned 8-bit integers.
As an example, let's make an image out of the first 10,000 bytes of /bin/bash:
>>> import numpy as np
>>> import cv2
>>> xbash = np.fromfile('/bin/bash', dtype='uint8')
>>> xbash.shape
(1086744,)
>>> cv2.imwrite('bash1.png', xbash[:10000].reshape(100,100))
In the above, we used the OpenCV library to write the integers to a PNG file. Any of several other imaging libraries could have been used.
This what the first 10,000 bytes of bash
"looks" like:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With