Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

group and classify words as well as characters

I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.

# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES

The code:

from collections import defaultdict
myl=defaultdict(list)

with open('test.txt') as f :
    for l in f:
        l = l.rstrip()
        try:
            tags = l.split('/')[1]
            myl[tags].append(l.split('/')[0])
            for t in tags:
                myl[t].append( l.split('/')[0])
        except:
            pass

output:

defaultdict(list,
            {'S': ['test', 'test', 'girl', 'house', 'wind'],
             'SE': ['girl'],
             'E': ['girl', 'house', 'man', 'man', 'wind'],
             '': ['home'],
             'SE123': ['house'],
             '1': ['house'],
             '2': ['house'],
             '3': ['house'],
             'ES': ['wind']})

SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?


Update:

I have managed to add bigrams, but how do I add 3, 4, 5 grams?

from collections import defaultdict
import nltk
myl=defaultdict(list)

with open('hi_IN.dic') as f :
    for l in f:
        l = l.rstrip()
        try:
            tags = l.split('/')[1]
            ntags=''.join(sorted(tags))
            myl[ntags].append(l.split('/')[0])
            for t in tags:
                myl[t].append( l.split('/')[0])
            bigrm = list(nltk.bigrams([i for i in tags]))
            nlist=[x+y for x, y in bigrm]
            for t1 in nlist:
                t1a=''.join(sorted(t1))
                myl[t1a].append(l.split('/')[0])
        except:
            pass

I guess it would help if I sort the tags at source:

with open('test1.txt', 'w') as nf:
    with open('test.txt') as f :
        for l in f:
            l = l.rstrip()
            try:
                tags = l.split('/')[1]
            except IndexError:
                nline= l 
            else:
                ntags=''.join(sorted(tags))
                nline= l.split('/')[0] + '/' + ntags
            nf.write(nline+'\n')

This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.


I downloaded a sample file:

!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic

The report using "grep" command is correct.

!grep 'P.*U' index1.dic

CPU/M
GPU
aware/PU
cleanly/PRTU
common/PRTUY
conscious/PUY
easy/PRTU
faithful/PUY
friendly/PRTU
godly/PRTU
grateful/PUY
happy/PRTU
healthy/PRTU
holy/PRTU
kind/PRTUY
lawful/PUY
likely/PRTU
lucky/PRTU
natural/PUY
obtrusive/PUY
pleasant/PTUY
prepared/PU
reasonable/PU
responsive/PUY
righteous/PU
scrupulous/PUY
seemly/PRTU
selfish/PUY
timely/PRTU
truthful/PUY
wary/PRTU
wholesome/PU
willing/PUY
worldly/PTU
worthy/PRTU

The python report using bigrams on sorted tags file does not contain all the words mentioned above.

myl['PU']

['aware',
 'aware',
 'conscious',
 'faithful',
 'grateful',
 'lawful',
 'natural',
 'obtrusive',
 'prepared',
 'prepared',
 'reasonable',
 'reasonable',
 'responsive',
 'righteous',
 'righteous',
 'scrupulous',
 'selfish',
 'truthful',
 'wholesome',
 'wholesome',
 'willing']
like image 353
shantanuo Avatar asked Nov 01 '18 07:11

shantanuo


3 Answers

Try this:

myl=dict()
with open('test.txt') as f :
    for l in f:
        l = l.rstrip()
        try:
            tags = l.split('/')[1]
            myl.setdefault(tags,[])
            myl[tags].append(l.split('/')[0])
            for t in tags:
                myl.setdefault(t,[])
                myl[t].append( l.split('/')[0])
        except:
            pass
keys=myl.keys()
for k1 in keys:
    for k2 in keys:
        if len(set(k1).intersection(k2))==len(set(k1)) and k1!=k2:
            myl[k1].extend([myk2v for myk2v in myl[k2] if myk2v not in myl[k1]]) 
print(myl)

Output

{'S': ['test', 'test', 'girl', 'house', 'wind'], 'SE': ['girl', 'house', 'wind'], 'E': ['girl', 'house', 'man', 'man', 'wind'], '': ['home', 'test', 'test', 'girl', 'house', 'wind', 'man', 'man'], 'SE123': ['house'], '1': ['house'], '2': ['house'], '3': ['house'], 'ES': ['wind', 'girl', 'house']}

In the last two for loops of the program, the set of k1 and k2 are taken first, and then, the intersection of the two sets is compared. If the length of the intersection is equal to the length of set k1, the value of the key k2 should be in the key k1, so the value of the key k2 is added to the key k1.

like image 21
myhaspldeep Avatar answered Nov 09 '22 00:11

myhaspldeep


Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC and this will list elements with tag ABCD, ACD, ZABC, etc.

We can thus construct a parser:

from collections import defaultdict

class Hunspell(object):

    def __init__(self, data):
        self.data = data

    def __getitem__(self, tags):
        if not tags:
            return self.data.get(None, [])

        elements = [self.data.get(tag ,()) for tag in tags]
        data = set.intersection(*map(set, elements))
        return [e for e in self.data.get(tags[0], ()) if e in data]

    @staticmethod
    def load(f):
       data = defaultdict(list)
       for line in f:
           try:
               element, tags = line.rstrip().split('/', 1)
               for tag in tags:
                   data[tag].append(element)
               data[None].append(element)
           except ValueError:
               pass  # element with no tags
       return Hunspell(dict(data))

The list processing at the end of __getitem__ is done to retrieve the elements in the correct order.

We can then load the file into memory with:

>>> with open('test.txt') as f:
...     h = Hunspell.load(f)

and query it for arbitrary keys:

>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']
[]

querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work

like image 196
Willem Van Onsem Avatar answered Nov 08 '22 23:11

Willem Van Onsem


A much simpler Version of what willem van onsem already answered

data = defaultdict(list)
with open('text.txt') as f:
    for line in f.readlines():
        try:
            element,tags = line.rstrip().split('/', 1)
            print(element)
            for tag in tags:
                data[tag].append(element)
                data[None].append(element)
        except ValueError:
            pass

def parse(data,tag):
    if(tag==None or tag==''):
        return set(data[None])
    elements = [set(data[tag_i]) for tag_i in tag]
    return set.intersection(*map(set, elements))



>>> parse(data,'ES') 
>>> {'girl', 'house', 'wind'}
>>> parse(data,None)
>>> {'girl', 'house', 'man', 'wind'}
>>> parse(data,'')
>>> {'girl', 'house', 'man', 'wind'}
like image 1
Pramit Sawant Avatar answered Nov 09 '22 01:11

Pramit Sawant