I have a list of strings that have to be not more than X characters. Each string can contain many sentences (separated by punctuation like dots). I need to separate longer sentences than X characters with this logic:
I have to divide them into the minimum number of parts (starting from 2), in order to have all the chunks with a lower length than X as similar as possible (possibly identical), but considering the punctuation (example: if I have Hello. How are you?, I can't divide it into Hello. Ho and w are you? but in Hello. and How are you? because it's the most similar way to divide it into two equal parts, without loosing the sense of the sentences)
max = 10
strings = ["Hello. How are you? I'm fine", "other string containg dots", "another string containg dots"]
for string in string:
if len(string) > max:
#algorithm to chunck it
In this case, I will have to divide the first string Hello. How are you? I'm fine into 3 parts because with 2 parts, I'll have one of the 2 chunks longer than 10 characters (max).
Is there a smart existing solution? Or does anyone know how to do that?
An example function for chunking string (within the character minimum and maximum lengths) by punctuation (e.g. ".", ",", ";", "?"); in other words, prioritizing punctuation over character length:
import numpy as np
def chunkingStringFunction(strings, charactersDefiningChunking = [".", ",", ";", "?"], numberOfMaximumCharactersPerChunk = None, numberOfMinimumCharactersPerChunk = None, **kwargs):
if numberOfMaximumCharactersPerChunk is None:
numberOfMaximumCharactersPerChunk = 100
if numberOfMinimumCharactersPerChunk is None:
numberOfMinimumCharactersPerChunk = 2
storingChunksOfString = []
for string in strings:
chunkingStartingAtThisIndex = 0
indexingCharactersInStrings = 0
while indexingCharactersInStrings < len(string) - 1:
indexingCharactersInStrings += 1
currentChunk = string[chunkingStartingAtThisIndex:indexingCharactersInStrings + 1]
if len(currentChunk) >= numberOfMinimumCharactersPerChunk and len(currentChunk) <= numberOfMaximumCharactersPerChunk:
indexesForStops = []
for indexingCharacterDefiningChunking in range(len(charactersDefiningChunking)):
indexesForStops.append(currentChunk.find(charactersDefiningChunking[indexingCharacterDefiningChunking]) + chunkingStartingAtThisIndex)
indexesForStops = np.max(indexesForStops, axis = None)
addChunk = string[chunkingStartingAtThisIndex:indexesForStops + 1]
if len(addChunk) > 1 and addChunk != " ":
storingChunksOfString.append(addChunk)
chunkingStartingAtThisIndex = indexesForStops + 1
indexingCharactersInStrings = chunkingStartingAtThisIndex
return storingChunksOfString
Alternatively, to prioritize character length; as in, if we want to consider our (average) character length and from there, find out where our defined characters for chunking are:
import numpy as np
def chunkingStringFunction(strings, charactersDefiningChunking = [".", ",", ";", "?"], averageNumberOfCharactersPerChunk = None, **kwargs):
if averageNumberOfCharactersPerChunk is None:
averageNumberOfCharactersPerChunk = 10
storingChunksOfString = []
for string in strings:
lastIndexChunked = 0
for indexingCharactersInString in range(1, len(string), 1):
chunkStopsAtADefinedCharacter = False
if indexingCharactersInString - lastIndexChunked == averageNumberOfCharactersPerChunk:
indexingNumberOfCharactersAwayFromAverageChunk = 1
while chunkStopsAtADefinedCharacter == False:
indexingNumberOfCharactersAwayFromAverageChunk += 1
for thisCharacter in charactersDefiningChunking:
findingAChunkCharacter = string[indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk:indexingCharactersInString + (indexingNumberOfCharactersAwayFromAverageChunk + 1)].find(thisCharacter)
if findingAChunkCharacter > -1 and len(string[lastIndexChunked:indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk + findingAChunkCharacter + 1]) != 0:
storingChunksOfString.append(string[lastIndexChunked:indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk + findingAChunkCharacter + 1])
lastIndexChunked = indexingCharactersInString - indexingNumberOfCharactersAwayFromAverageChunk + findingAChunkCharacter + 1
chunkStopsAtADefinedCharacter = True
elif indexingCharactersInString == len(string) - 1 and lastIndexChunked != len(string) - 1 and len(string[lastIndexChunked:indexingCharactersInString + 1]) != 0:
storingChunksOfString.append(string[lastIndexChunked:indexingCharactersInString + 1])
return storingChunksOfString
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With