I need to split on slash and then report tags. This is hunspell dictionary format. I tried to find a class on github that would do this, but could not find one.
# vi test.txt
test/S
boy
girl/SE
home/
house/SE123
man/E
country
wind/ES
The code:
from collections import defaultdict
myl=defaultdict(list)
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl[tags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
except:
pass
output:
defaultdict(list,
{'S': ['test', 'test', 'girl', 'house', 'wind'],
'SE': ['girl'],
'E': ['girl', 'house', 'man', 'man', 'wind'],
'': ['home'],
'SE123': ['house'],
'1': ['house'],
'2': ['house'],
'3': ['house'],
'ES': ['wind']})
SE group should have 3 words 'girl', 'wind' and 'house'. There should be no ES group because it is included and same as "SE" and SE123 should remain as is. how do I achieve this?
Update:
I have managed to add bigrams, but how do I add 3, 4, 5 grams?
from collections import defaultdict
import nltk
myl=defaultdict(list)
with open('hi_IN.dic') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
ntags=''.join(sorted(tags))
myl[ntags].append(l.split('/')[0])
for t in tags:
myl[t].append( l.split('/')[0])
bigrm = list(nltk.bigrams([i for i in tags]))
nlist=[x+y for x, y in bigrm]
for t1 in nlist:
t1a=''.join(sorted(t1))
myl[t1a].append(l.split('/')[0])
except:
pass
I guess it would help if I sort the tags at source:
with open('test1.txt', 'w') as nf:
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
except IndexError:
nline= l
else:
ntags=''.join(sorted(tags))
nline= l.split('/')[0] + '/' + ntags
nf.write(nline+'\n')
This will create a new file test1.txt with sorted tags. But the problem of trigrams+ still not resolved.
I downloaded a sample file:
!wget https://raw.githubusercontent.com/wooorm/dictionaries/master/dictionaries/en-US/index.dic
The report using "grep" command is correct.
!grep 'P.*U' index1.dic
CPU/M
GPU
aware/PU
cleanly/PRTU
common/PRTUY
conscious/PUY
easy/PRTU
faithful/PUY
friendly/PRTU
godly/PRTU
grateful/PUY
happy/PRTU
healthy/PRTU
holy/PRTU
kind/PRTUY
lawful/PUY
likely/PRTU
lucky/PRTU
natural/PUY
obtrusive/PUY
pleasant/PTUY
prepared/PU
reasonable/PU
responsive/PUY
righteous/PU
scrupulous/PUY
seemly/PRTU
selfish/PUY
timely/PRTU
truthful/PUY
wary/PRTU
wholesome/PU
willing/PUY
worldly/PTU
worthy/PRTU
The python report using bigrams on sorted tags file does not contain all the words mentioned above.
myl['PU']
['aware',
'aware',
'conscious',
'faithful',
'grateful',
'lawful',
'natural',
'obtrusive',
'prepared',
'prepared',
'reasonable',
'reasonable',
'responsive',
'righteous',
'righteous',
'scrupulous',
'selfish',
'truthful',
'wholesome',
'wholesome',
'willing']
Try this:
myl=dict()
with open('test.txt') as f :
for l in f:
l = l.rstrip()
try:
tags = l.split('/')[1]
myl.setdefault(tags,[])
myl[tags].append(l.split('/')[0])
for t in tags:
myl.setdefault(t,[])
myl[t].append( l.split('/')[0])
except:
pass
keys=myl.keys()
for k1 in keys:
for k2 in keys:
if len(set(k1).intersection(k2))==len(set(k1)) and k1!=k2:
myl[k1].extend([myk2v for myk2v in myl[k2] if myk2v not in myl[k1]])
print(myl)
Output
{'S': ['test', 'test', 'girl', 'house', 'wind'], 'SE': ['girl', 'house', 'wind'], 'E': ['girl', 'house', 'man', 'man', 'wind'], '': ['home', 'test', 'test', 'girl', 'house', 'wind', 'man', 'man'], 'SE123': ['house'], '1': ['house'], '2': ['house'], '3': ['house'], 'ES': ['wind', 'girl', 'house']}
In the last two for loops of the program, the set of k1
and k2
are taken first, and then, the intersection of the two sets is compared. If the length of the intersection is equal to the length of set k1
, the value of the key k2
should be in the key k1
, so the value of the key k2
is added to the key k1
.
Given I understand it correctly, this is more a matter of constructing a datastructure that, for a given tag, constructs the correct list. We can do this by constructing a dictionary that only takes into account singular tags. Later, when a person queries on multiple tags, we calculate the intersection. This thus makes it compact to represent, as well as easy to extract for example all the elements with tag AC
and this will list elements with tag ABCD
, ACD
, ZABC
, etc.
We can thus construct a parser:
from collections import defaultdict
class Hunspell(object):
def __init__(self, data):
self.data = data
def __getitem__(self, tags):
if not tags:
return self.data.get(None, [])
elements = [self.data.get(tag ,()) for tag in tags]
data = set.intersection(*map(set, elements))
return [e for e in self.data.get(tags[0], ()) if e in data]
@staticmethod
def load(f):
data = defaultdict(list)
for line in f:
try:
element, tags = line.rstrip().split('/', 1)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass # element with no tags
return Hunspell(dict(data))
The list processing at the end of __getitem__
is done to retrieve the elements in the correct order.
We can then load the file into memory with:
>>> with open('test.txt') as f:
... h = Hunspell.load(f)
and query it for arbitrary keys:
>>> h['SE']
['girl', 'house', 'wind']
>>> h['ES']
['girl', 'house', 'wind']
>>> h['1']
['house']
>>> h['']
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['S3']
['house']
>>> h['S2']
['house']
>>> h['SE2']
['house']
>>> h[None]
['test', 'girl', 'home', 'house', 'man', 'wind']
>>> h['4']
[]
querying for non-existing tags will result in an empty list. Here we thus postpone the "intersection" process upon calling. We can in fact already generate all possible intersections, but this will result in a large datastructure, and perhaps a large amount of work
A much simpler Version of what willem van onsem already answered
data = defaultdict(list)
with open('text.txt') as f:
for line in f.readlines():
try:
element,tags = line.rstrip().split('/', 1)
print(element)
for tag in tags:
data[tag].append(element)
data[None].append(element)
except ValueError:
pass
def parse(data,tag):
if(tag==None or tag==''):
return set(data[None])
elements = [set(data[tag_i]) for tag_i in tag]
return set.intersection(*map(set, elements))
>>> parse(data,'ES')
>>> {'girl', 'house', 'wind'}
>>> parse(data,None)
>>> {'girl', 'house', 'man', 'wind'}
>>> parse(data,'')
>>> {'girl', 'house', 'man', 'wind'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With