Given a textfile of lines of 3-tuples:
(0, 12, Tokenization)
(13, 15, is)
(16, 22, widely)
(23, 31, regarded)
(32, 34, as)
(35, 36, a)
(37, 43, solved)
(44, 51, problem)
(52, 55, due)
(56, 58, to)
(59, 62, the)
(63, 67, high)
(68, 76, accuracy)
(77, 81, that)
(82, 91, rulebased)
(92, 102, tokenizers)
(103, 110, achieve)
(110, 111, .)
(0, 3, But)
(4, 14, rule-based)
(15, 25, tokenizers)
(26, 29, are)
(30, 34, hard)
(35, 37, to)
(38, 46, maintain)
(47, 50, and)
(51, 56, their)
(57, 62, rules)
(63, 71, language)
(72, 80, specific)
(80, 81, .)
(0, 2, We)
(3, 7, show)
(8, 12, that)
(13, 17, high)
(18, 26, accuracy)
(27, 31, word)
(32, 35, and)
(36, 44, sentence)
(45, 57, segmentation)
(58, 61, can)
(62, 64, be)
(65, 73, achieved)
(74, 76, by)
(77, 82, using)
(83, 93, supervised)
(94, 102, sequence)
(103, 111, labeling)
(112, 114, on)
(115, 118, the)
(119, 128, character)
(129, 134, level)
(135, 143, combined)
(144, 148, with)
(149, 161, unsupervised)
(162, 169, feature)
(170, 178, learning)
(178, 179, .)
(0, 2, We)
(3, 12, evaluated)
(13, 16, our)
(17, 23, method)
(24, 26, on)
(27, 32, three)
(33, 42, languages)
(43, 46, and)
(47, 55, obtained)
(56, 61, error)
(62, 67, rates)
(68, 70, of)
(71, 75, 0.27)
(76, 77, ‰)
(78, 79, ()
(79, 86, English)
(86, 87, ))
(87, 88, ,)
(89, 93, 0.35)
(94, 95, ‰)
(96, 97, ()
(97, 102, Dutch)
(102, 103, ))
(104, 107, and)
(108, 112, 0.76)
(113, 114, ‰)
(115, 116, ()
(116, 123, Italian)
(123, 124, ))
(125, 128, for)
(129, 132, our)
(133, 137, best)
(138, 144, models)
(144, 145, .)
The goal is to achieve two different data types:
sents_with_positions
: a list of list of tuples where the the tuples looks like each line of the textfilesents_words
: a list of list of string made up of only the third element in the tuples from each line of the textfileE.g. From the input textfile:
sents_words = [
('Tokenization', 'is', 'widely', 'regarded', 'as', 'a', 'solved',
'problem', 'due', 'to', 'the', 'high', 'accuracy', 'that', 'rulebased',
'tokenizers', 'achieve', '.'),
('But', 'rule-based', 'tokenizers', 'are', 'hard', 'to', 'maintain', 'and',
'their', 'rules', 'language', 'specific', '.'),
('We', 'show', 'that', 'high', 'accuracy', 'word', 'and', 'sentence',
'segmentation', 'can', 'be', 'achieved', 'by', 'using', 'supervised',
'sequence', 'labeling', 'on', 'the', 'character', 'level', 'combined',
'with', 'unsupervised', 'feature', 'learning', '.')
]
sents_with_positions = [
[(0, 12, 'Tokenization'), (13, 15, 'is'), (16, 22, 'widely'),
(23, 31, 'regarded'), (32, 34, 'as'), (35, 36, 'a'), (37, 43, 'solved'),
(44, 51, 'problem'), (52, 55, 'due'), (56, 58, 'to'), (59, 62, 'the'),
(63, 67, 'high'), (68, 76, 'accuracy'), (77, 81, 'that'),
(82, 91, 'rulebased'), (92, 102, 'tokenizers'), (103, 110, 'achieve'),
(110, 111, '.')],
[(0, 3, 'But'), (4, 14, 'rule-based'), (15, 25, 'tokenizers'),
(26, 29, 'are'), (30, 34, 'hard'), (35, 37, 'to'), (38, 46, 'maintain'),
(47, 50, 'and'), (51, 56, 'their'), (57, 62, 'rules'),
(63, 71, 'language'), (72, 80, 'specific'), (80, 81, '.')],
[(0, 2, 'We'), (3, 7, 'show'), (8, 12, 'that'), (13, 17, 'high'),
(18, 26, 'accuracy'), (27, 31, 'word'), (32, 35, 'and'),
(36, 44, 'sentence'), (45, 57, 'segmentation'), (58, 61, 'can'),
(62, 64, 'be'), (65, 73, 'achieved'), (74, 76, 'by'), (77, 82, 'using'),
(83, 93, 'supervised'), (94, 102, 'sequence'), (103, 111, 'labeling'),
(112, 114, 'on'), (115, 118, 'the'), (119, 128, 'character'),
(129, 134, 'level'), (135, 143, 'combined'), (144, 148, 'with'),
(149, 161, 'unsupervised'), (162, 169, 'feature'), (170, 178, 'learning'),
(178, 179, '.')]
]
I have been doing it by:
sents_with_positions
sents_with_positions
, I append the last elements of the tuples for each sentence to sents_words
Code:
sents_with_positions = []
sents_words = []
_sent = []
for line in _input.split('\n'):
if len(line.strip()) > 0:
line = line[1:-1]
start, _, next = line.partition(',')
end, _, next = next.partition(',')
text = next.strip()
_sent.append((int(start), int(end), text))
else:
sents_with_positions.append(_sent)
sents_words.append(list(zip(*_sent))[2])
_sent = []
But is there a simpler way or cleaner way to do achieve the same output? Maybe through regexes? Or some itertools
trick?
Note that there are cases where there're tricky tuples in the lines of the textfile, e.g.
(86, 87, ))
# Sometimes the token/word is a bracket(96, 97, ()
(87, 88, ,)
# Sometimes the token/word is a comma(29, 33, Café)
# The token/word is a unicode (sometimes accented), so [a-zA-Z] might be insufficient(2, 3, 2)
# Sometimes the token/word is a number(47, 52, 3,000)
# Sometimes the token/word is a number/word with comma(23, 29, (e.g.))
# Someimtes the token/word contains bracket.In python tuples can be unpacked using a function in function tuple is passed and in function values are unpacked into normal variable.
Python offers a very powerful tuple assignment tool that maps right hand side arguments into left hand side arguments. THis act of mapping together is known as unpacking of a tuple of values into a norml variable. WHereas in packing, we put values into a regular tuple by means of regular assignment.
Tuple unpacking refers to assigning a tuple into multiple variables. We just assigned multiple variables, which are "Tarantino" , 2012 , 8.4 and "Waltz & DiCaprio" , into a tuple, which is django_movie .
This is, in my opinion, a little more readable and clear, but it may be a little less performant and assumes the input file is correctly formatted (e.g. empty lines are really empty, while your code works even if there is some random whitespace in the "empty" lines). It leverages regex groups, they do all the work of parsing the lines, we just convert start and end to integers.
line_regex = re.compile('^\((\d+), (\d+), (.+)\)$', re.MULTILINE)
sents_with_positions = []
sents_words = []
for section in _input.split('\n\n'):
words_with_positions = [
(int(start), int(end), text)
for start, end, text in line_regex.findall(section)
]
words = tuple(t[2] for t in words_with_positions)
sents_with_positions.append(words_with_positions)
sents_words.append(words)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With