Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How best to store large sequences of text in Python?

I recently discovered that a student of mine was doing an independent project in which he was using very large strings (2-4MB) as values in a dictionary.

I've never had a reason to work with such large blocks of text and it got me wondering if there were performance issues associated with creating such large strings.

Is there a better way of doing it than to simply create a string? I realize this question is largely context dependent, but I'm looking for generalized answers that may cover more than one possible use-case.

If you were working with that much text, how would you store it in your code, and would you do anything different than if you were simply working with an ordinary string of only a few characters?

like image 401
temporary_user_name Avatar asked Nov 11 '22 15:11

temporary_user_name


1 Answers

It depends a lot on what you're doing with the strings. I'm not exactly sure how Python stores strings but I've done a lot of work on XEmacs (similar to GNU Emacs) and on the underlying implementation of Emacs Lisp, which is a dynamic language like Python, and I know how strings are implemented there. Strings are going to be stored as blocks of memory similar to arrays. There's not a huge issue creating large arrays in Python, so I don't think simply storing the strings this way will cause performance issues. Some things to consider though:

  1. How are you building up the string? If you build up piece-by-piece by simply appending to ever larger strings, you have an O(N^2) algorithm that will be very slow. Java handles this with a StringBuilder class. I'm not sure if there's an exact equivalent in Python but you can simply create an array with all the parts you want to join together, then join at the end using ''.join(array).

  2. Do you need to search the string? This isn't related to creating the strings but it's something to consider. Searching will in general be O(n) in the size of the string; there are speedups that make it O(n/m) where m is the size of the substring you're searching for, but that's about it. The main consideration here is whether to store one big string or a series of substrings. If you need to search all the substrings, that won't help much over searching a big string, but it's possible you might know in advance that some parts don't need to be searched.

  3. Do you need to access substrings? Again, this isn't related to creating the strings, it's something to consider. Accessing a substring by position is just a matter of indexing to the right memory location, but if you need to take large substrings, it may be inefficient, and you might be able to speed things up by storing your string as an array of substrings, and then creating a new string as another array with some of the strings shared. However, doing it this way takes work, and shouldn't be done unless it's really necessary.

In sum, I think for simple cases it's fine to have large strings like this, but you should think about the sorts of operations you're going to perform and what their O(...) time is.

like image 199
Urban Vagabond Avatar answered Nov 14 '22 23:11

Urban Vagabond