This might be a very broad question. I wanted to create a way to represent strings that would support O(1) append, O(1) append to the left, and O(1) comparison while maintaining O(N) slicing and O(1) indexing. My idea is that I would store unicode characters as their unicode number, and use mathematical operations to add and delete characters from either end. I called it NumString:
class Numstring:
def __init__(self, init_str=""):
self.num = 0
self.length = 0
for char in init_str:
self.append(char)
def toString(self, curr=None):
if curr is None:
curr = self.num
retlst = []
while curr:
char = chr(curr % 10000)
retlst.append(char)
curr = curr // 10000
return "".join(retlst)
def appendleft(self, char):
assert len(char) == 1
self.num *= 10000
self.num += ord(char)
self.length += 1
def append(self, char):
self.num += ord(char)*(10000**self.length)
self.length += 1
def pop(self):
# self.num = self.num % (10000**self.length-1)
self.num = self.num % 10000**(self.length-1)
self.length -= 1
def popleft(self):
self.num = self.num // 10000
self.length -= 1
def compare(self, numstring2):
return self.num == numstring2.num
def size(self):
return self.length
def get(self, start, end=None):
if start >= self.length:
raise IndexError("index out of bounds")
if end and end > self.length + 1:
raise IndexError("index out of bounds")
if end is not None:
if start == end:
return ""
curr = self.num
curr = curr // (10000 ** start)
curr = curr % 10000**(end)
return self.toString(curr)
else:
curr = self.num
curr = curr // (10000 ** start)
char = chr(curr % 10000)
return char
Here is the profiling code:
import string
import time
import random
from NumString import Numstring
import numpy as np
import matplotlib.pyplot as plt
if __name__ == '__main__':
numstring_times = []
regstring_times = []
numstring = Numstring("abc")
regstring = "abc"
for i in range(0, 10000):
char_to_add = random.choice(string.ascii_letters)
start_time = time.time()
numstring.append(char_to_add)
end_time = time.time()
numstring_times.append(end_time-start_time)
start_time = time.time()
regstring += char_to_add
end_time = time.time()
regstring_times.append(end_time-start_time)
plt.plot(numstring_times, label="Numstring")
plt.plot(regstring_times, label="Regular strings")
plt.legend()
plt.show()
It works the way I want it to, but when I tested the time it takes to append to the string and my Numstring class performed way worse. I understood that there would be overhead, but I didn't expect the overall trend to be way worse than concatenating strings in python.
Curiously, comparison times are actually better for Numstrings than regular strings:
I've realized I don't really understand how integer operations in Python are implemented and what the limitations of infinite integer precision are. Could somebody help me understand what I'm missing?
Python uses the ob_digit array to store each digit of the number separately in different index locations. Additionally, the ob_size variable is used to store two values. It stores the length of the ob_digit array and the sign of the integer (positive or negative).
Python supports a "bignum" integer type which can work with arbitrarily large numbers. In Python 2.5+, this type is called long and is separate from the int type, but the interpreter will automatically use whichever is more appropriate.
In Python, we can perform floor division (also sometimes known as integer division) using the // operator. This operator will divide the first argument by the second and round the result down to the nearest whole number, making it equivalent to the math. floor() function.
int
s are essentially represented as strings of digits (in a higher base than 10), and most operations on them, including addition and multiplication, take time proportional to the number of digits they contain or worse. Because the base you’re using (10000) doesn’t match the base int
s use, operations like multiplying or dividing by the base become a complex operation instead of a simple copying of bytes of memory. So it’s pretty much a less efficient reimplementation of the operations strings already do (which is what you found by benchmarking) and it doesn’t support all of Unicode (which goes up to 0x10FFFF, not 10000).
Hint for a data structure that actually has the properties you’re looking for, though: a deque based on a circular buffer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With