I have a large plain text document (UTF-8) that contains letters, numbers, spaces, and special characters etc.
I want to convert all the individual characters in the text document into numbers, and then represent the document as a numpy array.
Can I use the inbuilt python ord() function for this?
My understanding is that it returns an integer representing the Unicode code point of the character, but only takes on in one character at a time and I'm wondering if there's a better way to convert a large text document to numbers.
Or can I just iterate through the entire document with the ord() function?
edit
I basically want to do something exactly like this! but natively in python https://www.browserling.com/tools/text-to-ascii
This is what I currently have
def convert_to_ascii(text):
return ",".join(str(ord(char)) for char in text)
with open('test.txt', 'r') as myfile:
data = myfile.read()
convert_to_ascii(data)
values = [int(i) for i in x.split(',')]
array = np.array(values)
Is there a better way to do this?
I've been working on the same issue, and came across a much simpler and faster technique, demonstrated below:
import numpy as np
text = 'abcABC00'
letter_array = np.fromiter(text, dtype='c')
letter_array.shape, letter_array.dtype
((8,), dtype('S1'))
ascii_array = letter_array.view(np.int8)
ascii_array.shape, ascii_array.dtype, ascii_array
((8,), dtype('int8'), array([97, 98, 99, 65, 66, 67, 48, 48], dtype=int8))
I included intermediate values just to show what's going on, but the production code could be reduced to a single line.
ascii_array = np.fromiter(text, dtype='c').view(np.int8)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With