Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unpacking tuple-like textfile

Given a textfile of lines of 3-tuples:

(0, 12, Tokenization)
(13, 15, is)
(16, 22, widely)
(23, 31, regarded)
(32, 34, as)
(35, 36, a)
(37, 43, solved)
(44, 51, problem)
(52, 55, due)
(56, 58, to)
(59, 62, the)
(63, 67, high)
(68, 76, accuracy)
(77, 81, that)
(82, 91, rulebased)
(92, 102, tokenizers)
(103, 110, achieve)
(110, 111, .)

(0, 3, But)
(4, 14, rule-based)
(15, 25, tokenizers)
(26, 29, are)
(30, 34, hard)
(35, 37, to)
(38, 46, maintain)
(47, 50, and)
(51, 56, their)
(57, 62, rules)
(63, 71, language)
(72, 80, specific)
(80, 81, .)

(0, 2, We)
(3, 7, show)
(8, 12, that)
(13, 17, high)
(18, 26, accuracy)
(27, 31, word)
(32, 35, and)
(36, 44, sentence)
(45, 57, segmentation)
(58, 61, can)
(62, 64, be)
(65, 73, achieved)
(74, 76, by)
(77, 82, using)
(83, 93, supervised)
(94, 102, sequence)
(103, 111, labeling)
(112, 114, on)
(115, 118, the)
(119, 128, character)
(129, 134, level)
(135, 143, combined)
(144, 148, with)
(149, 161, unsupervised)
(162, 169, feature)
(170, 178, learning)
(178, 179, .)

(0, 2, We)
(3, 12, evaluated)
(13, 16, our)
(17, 23, method)
(24, 26, on)
(27, 32, three)
(33, 42, languages)
(43, 46, and)
(47, 55, obtained)
(56, 61, error)
(62, 67, rates)
(68, 70, of)
(71, 75, 0.27)
(76, 77, ‰)
(78, 79, ()
(79, 86, English)
(86, 87, ))
(87, 88, ,)
(89, 93, 0.35)
(94, 95, ‰)
(96, 97, ()
(97, 102, Dutch)
(102, 103, ))
(104, 107, and)
(108, 112, 0.76)
(113, 114, ‰)
(115, 116, ()
(116, 123, Italian)
(123, 124, ))
(125, 128, for)
(129, 132, our)
(133, 137, best)
(138, 144, models)
(144, 145, .)

The goal is to achieve two different data types:

  • sents_with_positions: a list of list of tuples where the the tuples looks like each line of the textfile
  • sents_words: a list of list of string made up of only the third element in the tuples from each line of the textfile

E.g. From the input textfile:

sents_words = [
    ('Tokenization', 'is', 'widely', 'regarded', 'as', 'a', 'solved',
     'problem', 'due', 'to', 'the', 'high', 'accuracy', 'that', 'rulebased',
     'tokenizers', 'achieve', '.'),
    ('But', 'rule-based', 'tokenizers', 'are', 'hard', 'to', 'maintain', 'and',
     'their', 'rules', 'language', 'specific', '.'),
    ('We', 'show', 'that', 'high', 'accuracy', 'word', 'and', 'sentence',
     'segmentation', 'can', 'be', 'achieved', 'by', 'using', 'supervised',
     'sequence', 'labeling', 'on', 'the', 'character', 'level', 'combined',
     'with', 'unsupervised', 'feature', 'learning', '.')
]

sents_with_positions = [
    [(0, 12, 'Tokenization'), (13, 15, 'is'), (16, 22, 'widely'),
     (23, 31, 'regarded'), (32, 34, 'as'), (35, 36, 'a'), (37, 43, 'solved'),
     (44, 51, 'problem'), (52, 55, 'due'), (56, 58, 'to'), (59, 62, 'the'),
     (63, 67, 'high'), (68, 76, 'accuracy'), (77, 81, 'that'),
     (82, 91, 'rulebased'), (92, 102, 'tokenizers'), (103, 110, 'achieve'),
     (110, 111, '.')],
    [(0, 3, 'But'), (4, 14, 'rule-based'), (15, 25, 'tokenizers'),
     (26, 29, 'are'), (30, 34, 'hard'), (35, 37, 'to'), (38, 46, 'maintain'),
     (47, 50, 'and'), (51, 56, 'their'), (57, 62, 'rules'),
     (63, 71, 'language'), (72, 80, 'specific'), (80, 81, '.')],
    [(0, 2, 'We'), (3, 7, 'show'), (8, 12, 'that'), (13, 17, 'high'),
     (18, 26, 'accuracy'), (27, 31, 'word'), (32, 35, 'and'),
     (36, 44, 'sentence'), (45, 57, 'segmentation'), (58, 61, 'can'),
     (62, 64, 'be'), (65, 73, 'achieved'), (74, 76, 'by'), (77, 82, 'using'),
     (83, 93, 'supervised'), (94, 102, 'sequence'), (103, 111, 'labeling'),
     (112, 114, 'on'), (115, 118, 'the'), (119, 128, 'character'),
     (129, 134, 'level'), (135, 143, 'combined'), (144, 148, 'with'),
     (149, 161, 'unsupervised'), (162, 169, 'feature'), (170, 178, 'learning'),
     (178, 179, '.')]
]

I have been doing it by:

  • iterating through each line of the textfile, process the tuple, and then appending them to a list to get sents_with_positions
  • and while appending each process sentence to sents_with_positions, I append the last elements of the tuples for each sentence to sents_words

Code:

sents_with_positions = []
sents_words = []
_sent = []
for line in _input.split('\n'):
    if len(line.strip()) > 0:
        line = line[1:-1]
        start, _, next = line.partition(',')
        end, _, next = next.partition(',')
        text = next.strip()
        _sent.append((int(start), int(end), text))
    else:
        sents_with_positions.append(_sent)
        sents_words.append(list(zip(*_sent))[2])
        _sent = []

But is there a simpler way or cleaner way to do achieve the same output? Maybe through regexes? Or some itertools trick?

Note that there are cases where there're tricky tuples in the lines of the textfile, e.g.

  • (86, 87, )) # Sometimes the token/word is a bracket
  • (96, 97, ()
  • (87, 88, ,) # Sometimes the token/word is a comma
  • (29, 33, Café) # The token/word is a unicode (sometimes accented), so [a-zA-Z] might be insufficient
  • (2, 3, 2) # Sometimes the token/word is a number
  • (47, 52, 3,000) # Sometimes the token/word is a number/word with comma
  • (23, 29, (e.g.)) # Someimtes the token/word contains bracket.
like image 216
alvas Avatar asked Dec 22 '15 12:12

alvas


People also ask

Can tuples be unpacked?

In python tuples can be unpacked using a function in function tuple is passed and in function values are unpacked into normal variable.

What do you mean by unpacking tuple?

Python offers a very powerful tuple assignment tool that maps right hand side arguments into left hand side arguments. THis act of mapping together is known as unpacking of a tuple of values into a norml variable. WHereas in packing, we put values into a regular tuple by means of regular assignment.

Which of the given is example of unpacking concepts in tuple?

Tuple unpacking refers to assigning a tuple into multiple variables. We just assigned multiple variables, which are "Tarantino" , 2012 , 8.4 and "Waltz & DiCaprio" , into a tuple, which is django_movie .


1 Answers

This is, in my opinion, a little more readable and clear, but it may be a little less performant and assumes the input file is correctly formatted (e.g. empty lines are really empty, while your code works even if there is some random whitespace in the "empty" lines). It leverages regex groups, they do all the work of parsing the lines, we just convert start and end to integers.

line_regex = re.compile('^\((\d+), (\d+), (.+)\)$', re.MULTILINE)
sents_with_positions = []
sents_words = []

for section in _input.split('\n\n'):
    words_with_positions = [
        (int(start), int(end), text)
        for start, end, text in line_regex.findall(section)
    ]
    words = tuple(t[2] for t in words_with_positions)
    sents_with_positions.append(words_with_positions)
    sents_words.append(words)
like image 89
LeartS Avatar answered Oct 05 '22 00:10

LeartS