Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert text document to numpy array of ASCII numbers in python

I have a large plain text document (UTF-8) that contains letters, numbers, spaces, and special characters etc.

I want to convert all the individual characters in the text document into numbers, and then represent the document as a numpy array.

Can I use the inbuilt python ord() function for this?

My understanding is that it returns an integer representing the Unicode code point of the character, but only takes on in one character at a time and I'm wondering if there's a better way to convert a large text document to numbers.

Or can I just iterate through the entire document with the ord() function?

edit

I basically want to do something exactly like this! but natively in python https://www.browserling.com/tools/text-to-ascii

This is what I currently have

def convert_to_ascii(text):
    return ",".join(str(ord(char)) for char in text)

with open('test.txt', 'r') as myfile:
    data = myfile.read()

convert_to_ascii(data)

values = [int(i) for i in x.split(',')] 

array = np.array(values)

Is there a better way to do this?

like image 569
borkbork Avatar asked Apr 16 '26 19:04

borkbork


1 Answers

I've been working on the same issue, and came across a much simpler and faster technique, demonstrated below:

import numpy as np

text = 'abcABC00'

letter_array = np.fromiter(text, dtype='c')
letter_array.shape, letter_array.dtype

    ((8,), dtype('S1'))


ascii_array = letter_array.view(np.int8)
ascii_array.shape, ascii_array.dtype, ascii_array

    ((8,), dtype('int8'), array([97, 98, 99, 65, 66, 67, 48, 48], dtype=int8))

I included intermediate values just to show what's going on, but the production code could be reduced to a single line.

ascii_array = np.fromiter(text, dtype='c').view(np.int8)
like image 172
Bill Smith Avatar answered Apr 19 '26 10:04

Bill Smith



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!