Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently search list elements in a string in python

Tags:

python

list

I have a list of concepts (myconcepts) and a list of sentences (sentences) as follows.

concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']]


sentences = ['data mining and text mining', 'nlp is mainly used by discourse analysis community', 'data mining in python is fun', 'mooc data analysis involves texts', 'data and data mining are both very interesting']

In a nutshell, I want to find the concepts in sentences. More specifically, given a list in concepts (e.g., ['natural language processing', 'text mining', 'texts', 'nlp']), I want to identify these concepts in the sentence and replace them by its first element (i.e. natural language processing).

Example: So, if we consider the sentence data mining and text mining; the results should be advanced data mining and natural language processing. (because the first elements of data mining and text mining are advanced data mining and natural language processing respectively).

The results of the above dummy data should be:

['advanced data mining and natural language processing', 'natural language processing is mainly used by discourse analysis community', 'advanced data mining in python is fun', 'discourse analysis advanced data mining analysis involves natural language processing', 'advanced data mining and advanced data mining are both very interesting']

I am currently doing this using regex as follows:

concepts_re = []

for item in sorted_wikipedia_redirects:
        item_re = "|".join(re.escape(item) for item in item)
        concepts_re.append(item_re)

sentences_mapping = []

for sentence in sentences:
    for terms in concepts:
        if len(terms) > 1:
            for item in terms:
                if item in sentence:
                    sentence = re.sub(concepts_re[concepts.index(terms)], item[0], sentence)
    sentences_mapping.append(sentence)

In my real data set I have about 8 million concepts. Therefore, my approach is very inefficient and takes like 5 minutes to process one sentence. I would like to know if there is any efficient way of doing this in python.

For those who would like to process a long list of concepts to measure the time, I have attached a bit longer list herewith: https://drive.google.com/file/d/1OsggJTDZx67PGH4LupXIkCTObla0gDnX/view?usp=sharing

I am happy to provide more details if needed.

like image 951
EmJ Avatar asked Feb 01 '19 06:02

EmJ


People also ask

What is the fastest way to find an element in a list Python?

To find an element in the list, use the Python list index() method, The index() is an inbuilt Python method that searches for an item in the list and returns its index. The index() method finds the given element in the list and returns its position.

How do I find a substring in a list of strings in Python?

The easiest way to check if a Python string contains a substring is to use the in operator. The in operator is used to check data structures for membership in Python. It returns a Boolean (either True or False ).

How do you match a string to a list in Python?

Python Find String in List using count() We can also use count() function to get the number of occurrences of a string in the list. If its output is 0, then it means that string is not present in the list. l1 = ['A', 'B', 'C', 'D', 'A', 'A', 'C'] s = 'A' count = l1.


3 Answers

Solution provided below has approximately O(n) complexity when it comes to runtime, where n is the number of tokens in each sentence.

For 5 million sentences and your concepts.txt it performs required operations in ~30 seconds, see basic test in the third section.

When it comes to space complexity, you will have to keep a nested dictionary structure (let's simplify it like this for now), say it's O(c*u), where u are unique tokens for certain length of concept (token-wise), while c is the length of concept.

It's hard to pinpoint exact complexities, but it goes pretty similar to this (for your example data and the one you provided [concepts.txt] those are pretty accurate, but we will get to gory details as we go through the implementation).

I assume you can split your concepts and sentences on whitespace, if that's not the case I would advise you to take a look at spaCy, which provides more intelligent way to tokenize your data.

1. Introduction

Let's take your example:

concepts = [
    ["natural language processing", "text mining", "texts", "nlp"],
    ["advanced data mining", "data mining", "data"],
    ["discourse analysis", "learning analytics", "mooc"],
]

As you said, each element from concepts has to be mapped to the first one, so, in Pythonish it would go roughly along those lines:

for concept in concepts:
    concept[1:] = concept[0]

Task would be easy if all of the concepts had token length equal to one (which is not the case here), and would be unique. Let's focus on the second case and one particular (a little modified) example of concept to see my point:

["advanced data mining", "data something", "data"]

Here data would be mapped to advanced data mining, BUT data something, which consists of data should be mapped before it. If I understand you correctly, you would want this sentence:

"Here is data something and another data"

To be mapped onto:

"Here is advanced data mapping and another advanced data mining"

Instead of naive approach:

"Here is advanced data mapping something and another advanced data mining"

See that for the second example we only mapped data, not data something.

To prioritize data something (and others fitting this pattern) I have used an array structure filled with dictionaries, where concepts being earlier in the array are those which are longer token-wise.

To continue our example, such array would look like this:

structure = [
    {"data": {"something": "advanced data mining"}},
    {"data": "advanced data mining"},
]

Notice, that if we go through tokens in this order (e.g. first going through the first dictionary with consecutive tokens, if no match was found, go to the second dictionary and so on), we will get the longest concepts first.

2. Code

Okay, I hope you get the basic idea (if not, post a comment down below and I will try to explain unclear parts in more detail).

Disclaimer: I'm not particularly proud of this code-wise, but it gets the job done and could of been worse I suppose.

2.1 Hierarchical dictionary

First, let's get the longest concept token-wise (excluding the first element, as it's our target and we do not have to change it ever):

def get_longest(concepts: List[List[str]]):
    return max(len(text.split()) for concept in concepts for text in concept[1:])

Using this information, we can initialize our structure by creating as many dictionaries as different lengths of concepts (in the example above it would be 2, so it would be for all your data. Concepts of any length would do though):

def init_hierarchical_dictionaries(longest: int):
    return [(length, {}) for length in reversed(range(longest))]

Notice I am adding length of each concept to the array, IMO it is easier that way when it comes to traversing, you could go without it though after some changes to the implementation though.

Now, when we have those helper functions, we can create the structure from list of concepts:

def create_hierarchical_dictionaries(concepts: List[List[str]]):
    # Initialization
    longest = get_longest(concepts)
    hierarchical_dictionaries = init_hierarchical_dictionaries(longest)

    for concept in concepts:
        for text in concept[1:]:
            tokens = text.split()
            # Initialize dictionary; get the one with corresponding length.
            # The longer, the earlier it is in the hierarchy
            current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
            # All of the tokens except the last one are another dictionary mapping to
            # the next token in concept.
            for token in tokens[:-1]:
                current_dictionary[token] = {}
                current_dictionary = current_dictionary[token]

            # Last token is mapped to the first concept
            current_dictionary[tokens[-1]] = concept[0].split()

    return hierarchical_dictionaries

This function will create our hierarchical dictionary, see the comments in source code for some explanation. You may want to create a custom class keeping this thing, it should be easier to use that way.

This is exactly the same object as described in 1. Introduction

2.2 Traversing through dictionaries

This part is much harder, but let's use top-bottom approach this time. We will start easy:

def embed_sentences(sentences: List[str], hierarchical_dictionaries):
    return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)

Provided hierarchical dictionaries, it creates a generator which transforms each sentence according to the concepts mapping.

Now traverse function:

def traverse(sentence: str, hierarchical_dictionaries):
    # Get all tokens in the sentence
    tokens = sentence.split()
    output_sentence = []
    # Initialize index to the first token
    index = 0
    # Until any tokens left to check for concepts
    while index < len(tokens):
        # Iterate over hierarchical dictionaries (elements of the array)
        for hierarchical_dictionary_tuple in hierarchical_dictionaries:
            # New index is returned based on match and token-wise length of concept
            index, concept = traverse_through_dictionary(
                index, tokens, hierarchical_dictionary_tuple
            )
            # Concept was found in current hierarchical_dictionary_tuple, let's add it
            # to output
            if concept is not None:
                output_sentence.extend(concept)
                # No need to check other hierarchical dictionaries for matching concept
                break
        # Token (and it's next tokens) do not match with any concept, return original
        else:
            output_sentence.append(tokens[index])
        # Increment index in order to move to the next token
        index += 1

    # Join list of tokens into a sentence
    return " ".join(output_sentence)

Once again, if you're not sure what's going on, post a comment.

Using this approach, pessimistically, we will perform O(n*c!) checks, where n is the number of tokens in sentence, c is the token-wise length of longest concept and it's factorial. This case is extremely unlikely to happen in practice, each token in the sentence would have to almost perfectly fit the longest concept plus all shorter concept would have to be prefixes of the shortest one (like super data mining, super data and data).

It would be much closer to O(n) for any practical problem, as I've said before, using the data you have provided in .txt file it's O(3 * n) worst-case, usually O(2 * n).

Traversing through each dictionary:

def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
    # Get the level of nested dictionaries and initial dictionary
    length, current_dictionary = hierarchical_dictionary_tuple
    # inner_index will loop through tokens until match or no match was found
    inner_index = index
    for _ in range(length):
        # Get next nested dictionary and move inner_index to the next token
        current_dictionary = current_dictionary.get(tokens[inner_index])
        inner_index += 1
        # If no match was found in any level of dictionary
        # Return current index in sentence and None representing lack of concept.
        if current_dictionary is None or inner_index >= len(tokens):
            return index, None

    # If everything went fine through all nested dictionaries, check whether
    # last token corresponds to concept
    concept = current_dictionary.get(tokens[inner_index])
    if concept is None:
        return index, None
    # If so, return inner_index (we have moved length tokens, so we have to update it)
    return inner_index, concept

This constitutes "the meat" of my solution.

3. Results

Now, for brevity, whole source code is provided below (concepts.txt are the ones your provided):

import ast
import time
from typing import List


def get_longest(concepts: List[List[str]]):
    return max(len(text.split()) for concept in concepts for text in concept[1:])


def init_hierarchical_dictionaries(longest: int):
    return [(length, {}) for length in reversed(range(longest))]


def create_hierarchical_dictionaries(concepts: List[List[str]]):
    # Initialization
    longest = get_longest(concepts)
    hierarchical_dictionaries = init_hierarchical_dictionaries(longest)

    for concept in concepts:
        for text in concept[1:]:
            tokens = text.split()
            # Initialize dictionary; get the one with corresponding length.
            # The longer, the earlier it is in the hierarchy
            current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
            # All of the tokens except the last one are another dictionary mapping to
            # the next token in concept.
            for token in tokens[:-1]:
                current_dictionary[token] = {}
                current_dictionary = current_dictionary[token]

            # Last token is mapped to the first concept
            current_dictionary[tokens[-1]] = concept[0].split()

    return hierarchical_dictionaries


def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
    # Get the level of nested dictionaries and initial dictionary
    length, current_dictionary = hierarchical_dictionary_tuple
    # inner_index will loop through tokens until match or no match was found
    inner_index = index
    for _ in range(length):
        # Get next nested dictionary and move inner_index to the next token
        current_dictionary = current_dictionary.get(tokens[inner_index])
        inner_index += 1
        # If no match was found in any level of dictionary
        # Return current index in sentence and None representing lack of concept.
        if current_dictionary is None or inner_index >= len(tokens):
            return index, None

    # If everything went fine through all nested dictionaries, check whether
    # last token corresponds to concept
    concept = current_dictionary.get(tokens[inner_index])
    if concept is None:
        return index, None
    # If so, return inner_index (we have moved length tokens, so we have to update it)
    return inner_index, concept


def traverse(sentence: str, hierarchical_dictionaries):
    # Get all tokens in the sentence
    tokens = sentence.split()
    output_sentence = []
    # Initialize index to the first token
    index = 0
    # Until any tokens left to check for concepts
    while index < len(tokens):
        # Iterate over hierarchical dictionaries (elements of the array)
        for hierarchical_dictionary_tuple in hierarchical_dictionaries:
            # New index is returned based on match and token-wise length of concept
            index, concept = traverse_through_dictionary(
                index, tokens, hierarchical_dictionary_tuple
            )
            # Concept was found in current hierarchical_dictionary_tuple, let's add it
            # to output
            if concept is not None:
                output_sentence.extend(concept)
                # No need to check other hierarchical dictionaries for matching concept
                break
        # Token (and it's next tokens) do not match with any concept, return original
        else:
            output_sentence.append(tokens[index])
        # Increment index in order to move to the next token
        index += 1

    # Join list of tokens into a sentence
    return " ".join(output_sentence)


def embed_sentences(sentences: List[str], hierarchical_dictionaries):
    return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)


def sanity_check():
    concepts = [
        ["natural language processing", "text mining", "texts", "nlp"],
        ["advanced data mining", "data mining", "data"],
        ["discourse analysis", "learning analytics", "mooc"],
    ]
    sentences = [
        "data mining and text mining",
        "nlp is mainly used by discourse analysis community",
        "data mining in python is fun",
        "mooc data analysis involves texts",
        "data and data mining are both very interesting",
    ]

    targets = [
        "advanced data mining and natural language processing",
        "natural language processing is mainly used by discourse analysis community",
        "advanced data mining in python is fun",
        "discourse analysis advanced data mining analysis involves natural language processing",
        "advanced data mining and advanced data mining are both very interesting",
    ]

    hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)

    results = list(embed_sentences(sentences, hierarchical_dictionaries))
    if results == targets:
        print("Correct results")
    else:
        print("Incorrect results")


def speed_check():
    with open("./concepts.txt") as f:
        concepts = ast.literal_eval(f.read())

    initial_sentences = [
        "data mining and text mining",
        "nlp is mainly used by discourse analysis community",
        "data mining in python is fun",
        "mooc data analysis involves texts",
        "data and data mining are both very interesting",
    ]

    sentences = initial_sentences.copy()

    for i in range(1_000_000):
        sentences += initial_sentences

    start = time.time()
    hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)
    middle = time.time()
    letters = []
    for result in embed_sentences(sentences, hierarchical_dictionaries):
        letters.append(result[0].capitalize())
    end = time.time()
    print(f"Time for hierarchical creation {(middle-start) * 1000.0} ms")
    print(f"Time for embedding {(end-middle) * 1000.0} ms")
    print(f"Overall time elapsed {(end-start) * 1000.0} ms")


def main():
    sanity_check()
    speed_check()


if __name__ == "__main__":
    main()

Results of speed check provided below:

Time for hierarchical creation 107.71822929382324 ms
Time for embedding 30460.427284240723 ms
Overall time elapsed 30568.145513534546 ms

So, for 5 million sentences (5 sentences you provided concatenated 1 million times), and the concepts file you provided (1.1 mb), it takes approximately 30 seconds to perform the concept mapping, which isn't bad I suppose.

Dictionary should take, worst case scenario, as much memory as your input file (concepts.txt in this case), but usually will be lower/much lower, as it depends on the combination of concepts length and unique words for those words.

like image 106
Szymon Maszke Avatar answered Nov 01 '22 10:11

Szymon Maszke


Use a suffix array approach,

Skip this step if your data is already sanitized.

Firstly, sanitize your data replacing all white space characters with any character that you know won't be part of any concept or sentence.

Then build suffix arrays for all the sentences. This takes O(nLogn) time for each sentence. There are few algorithms that can do this in O(n) time using suffix trees

Once you have your suffix arrays ready for all the sentences, just perform a binary search for your concepts.

You can further optimize your search using LCP array. Refer: kasai's

Using both LCP and suffix arrays, time complexity of the search can be brought down to O(n).

Edit: This approach is generally used in sequence alignment on genomes and is quite popular as well. You should easily find the implementations that suit you.

like image 30
HariUserX Avatar answered Nov 01 '22 09:11

HariUserX


import re
concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']]
sentences = ['data mining and text mining', 'nlp is mainly used by discourse analysis community', 'data mining in python is fun', 'mooc data analysis involves texts', 'data and data mining are both very interesting']

replacementDict = {concept[0] : concept[1:] for concept in concepts}

finderAndReplacements = [(re.compile('(' + '|'.join(replacees) + ')'), replacement) 
for replacement, replacees in replacementDict.items()]

def sentenceReplaced(findRegEx, replacement, sentence):
    return findRegEx.sub(replacement, sentence, count=0)

def sentencesAllReplaced(sentences, finderAndReplacements=finderAndReplacements):
    for regex, replacement in finderAndReplacements:
        sentences = [sentenceReplaced(regex, replacement, sentence) for sentence in sentences]
    return sentences

print(sentencesAllReplaced(sentences))
  • setup: I preferred concepts represented as a dict where the keys, values are the replacement, replacees. Stored this in replacementDict
  • Compile a matching regular expression for each intended replacement group. Store it along with its intended replacement in the finderAndReplacements list.
  • sentenceReplaced function returns input sentence after substitution is performed. (Order of application here will be irrelevant so parallelization should be possible if we take care to avoid race conditions.)
  • Finally we cycle through and find/replace for each sentence. (A great deal of parallel structures would offer improved performance.)

I'd love to see some thorough benchmarking/testing/reporting because I'm sure there are a lot of subtleties depending on the nature of this tasks inputs (concepts, sentences) and the hardware running it.

In the case were sentences is a dominant input component compared to the concepts replacements I believe compiling the regular expression will be advantageous. When sentences are few and concepts many, especially if most concepts are not in any sentences, compiling these matchers would be a waste. And if there are very many replacees for every replacement the compiled method used may perform poorly or even error out . . . (Varying assumptions about input parameters offer a multitude of trade-off considerations as is often the case.)

like image 44
rigsby Avatar answered Nov 01 '22 11:11

rigsby