Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas counting the number of unique string sources for substrings

Let's say I have a list of 5 character strings like:

AAAAB
BBBBA
BBBBA
ABBBB

And I want to find and count every possible 4 character substring and keep track of the number of unique 5 character strings they come from. Meaning while BBBB is found in three different string sources there is only two unique sources.

Example output:

    substring    repeats    unique sources
0     AAAA          1              1
1     AAAB          1              1
2     BBBB          3              2
3     BBBA          2              1
4     ABBB          1              1

I have managed to do that on a small scale with just Python, a dictionary that gets updated, and two lists for comparing already existing substrings and full length strings. However, when applying that to my full data set (~160 000 full length strings (12 character) producing 150 million substrings (4 character)) the constant dictionary updating and list comparison process is too slow (my script's been running for a week now). Counting the number of substrings present across all full length strings is easily and cheaply accomplished in both Python and pandas.

So my question is: How do I efficiently count and update the count for unique full length sources for substrings in my DataFrame?

like image 285
Terribliny Avatar asked Nov 07 '22 01:11

Terribliny


1 Answers

TLDR: Here is an attempt that takes an estimated ~2 hours on my computer for the scale of data you describe.

import numpy as np
import pandas as pd

def substring_search(fullstrings, sublen=4):
    '''
    fullstrings: array like of strings
    sublen: length of substring to search
    '''
    # PART 1: FIND SUBSTRINGS

    # length of full strings, assumes all are same
    strsize = len(fullstrings[0])

    # get unique strings, # occurences
    strs, counts = np.unique(fullstrings, return_counts=True)
    fullstrings = pd.DataFrame({'string':strs,
                                'count':counts})
    unique_n = len(fullstrings)

    # create array to hold substrings
    substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str)
    substrings = pd.Series(substrings)

    # slice to find each substring
    c = 0
    while c + sublen <= strsize:
        sliced = fullstrings['string'].str.slice(c, c+sublen)
        s = c * unique_n
        e = s + unique_n
        substrings[s: e] = sliced
        c += 1

    # take the set of substrings, save in output df
    substrings = np.unique(substrings)
    output = pd.DataFrame({'substrings':substrings,
                           'repeats': 0,
                           'unique_sources': 0})

    # PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS

    for i, s in enumerate(output['substrings']):
        # check which fullstrings contain each substring
        idx = fullstrings['string'].str.contains(s)
        count = fullstrings['count'][idx].sum()
        output.loc[i, 'repeats'] = count
        output.loc[i, 'unique_sources'] = idx.sum()
    print('Finished!')

    return output

Applied to your example:

>>> example = ['AAAAB', 'BBBBA', 'BBBBA', 'ABBBB']
>>> substring_search(example)

  substrings  repeats  unique_sources
0       AAAA        1               1
1       AAAB        1               1
2       ABBB        1               1
3       BBBA        2               1
4       BBBB        3               2

Explanation

The basic idea in the above code is to loop over all the unique substrings, and (for each of them) check against the list of full strings using pandas str methods. This saves one for loop (i.e. you don't loop over each full string for each substring). The other idea is to only check unique full strings (in addition to unique substrings); you save the number of occurrences of each full string beforehand and correct the count at the end.

The basic structure is:

  1. Get the unique strings in the input, and record the number of times each occurs.
  2. Find all unique substrings in the input (I do this using pandas.Series.str.slice)
  3. Loop over each substring, and use pandas.Series.str.contains to (element-wise) check the full strings. Since these are unique and we know the number of times each occurred, we can fill both repeats and unique_sources.

Testing

Here is code I used to create larger input data:

n = 100
size = 12

letters = list(string.ascii_uppercase[:20])
bigger = [''.join(np.random.choice(letters, size)) for i in range(n)]

So bigger is n size-length strings:

['FQHMHSOIEKGO',
 'FLLNCKAHFISM',
 'LDKKRKJROIRL',
 ...
 'KDTTLOKCDMCD',
 'SKLNSAQQBQHJ',
 'TAIAGSIEQSGI']

With modified code that prints the progress (posted below), I tried with n=150000 and size=12, and got this initial output:

Starting main loop...
5%, 344.59 seconds
10.0%, 685.28 seconds

So 10 * 685 seconds / 60 (seconds/minute) = ~114 minutes. So 2 hours is not ideal, but practically more useful than 1-week. I don't doubt that there is some much cleverer way to do this, but maybe this can be helpful if nothing else is posted.

If you do use this code, you may want to verify that the results are correct with some smaller examples. One thing I was unsure of is whether you want to count whether a substring just appears in each full string (i.e. contains), or if you wanted the number of times it appears in a full string (i.e. count). That would at least hopefully be a small change.

Here is the additional code for printing progress while doing the search; there are just additional statements in #PART 2:

def substring_search_progress(fullstrings, sublen=4):
    '''
    fullstrings: array like of strings
    sublen: length of substring to search
    '''
    # PART 1: FIND SUBSTRINGS

    # length of full strings, assumes all are same
    strsize = len(fullstrings[0])

    # get unique strings, # occurences
    strs, counts = np.unique(fullstrings, return_counts=True)
    fullstrings = pd.DataFrame({'string':strs,
                                'count':counts})
    unique_n = len(fullstrings)

    # create array to hold substrings
    substrings = np.empty(unique_n * (strsize - sublen + 1), dtype=str)
    substrings = pd.Series(substrings)

    # slice to find each substring
    c = 0
    while c + sublen <= strsize:
        sliced = fullstrings['string'].str.slice(c, c+sublen)
        s = c * unique_n
        e = s + unique_n
        substrings[s: e] = sliced
        c += 1

    # take the set of substrings, save in output df
    substrings = np.unique(substrings)
    output = pd.DataFrame({'substrings':substrings,
                           'repeats': 0,
                           'unique_sources': 0})

    # PART 2: CHECKING FULL STRINGS FOR SUBSTRINGS

    # for marking progress
    total = len(output)
    every = 5
    progress = every

    # main loop
    print('Starting main loop...')
    start = time.time()
    for i, s in enumerate(output['substrings']):

        # progress
        if (i / total * 100) > progress:
            now = round(time.time() - start, 2)
            print(f'{progress}%, {now} seconds')
            progress = (((i / total * 100) // every) + 1) * every

        # check which fullstrings contain each substring
        idx = fullstrings['string'].str.contains(s)
        count = fullstrings['count'][idx].sum()
        output.loc[i, 'repeats'] = count
        output.loc[i, 'unique_sources'] = idx.sum()
    print('Finished!')

    return output
like image 127
Tom Avatar answered Nov 14 '22 23:11

Tom