Unpacking tuple-like textfile

Tags:

Given a textfile of lines of 3-tuples:

(0, 12, Tokenization)
(13, 15, is)
(16, 22, widely)
(23, 31, regarded)
(32, 34, as)
(35, 36, a)
(37, 43, solved)
(44, 51, problem)
(52, 55, due)
(56, 58, to)
(59, 62, the)
(63, 67, high)
(68, 76, accuracy)
(77, 81, that)
(82, 91, rulebased)
(92, 102, tokenizers)
(103, 110, achieve)
(110, 111, .)

(0, 3, But)
(4, 14, rule-based)
(15, 25, tokenizers)
(26, 29, are)
(30, 34, hard)
(35, 37, to)
(38, 46, maintain)
(47, 50, and)
(51, 56, their)
(57, 62, rules)
(63, 71, language)
(72, 80, specific)
(80, 81, .)

(0, 2, We)
(3, 7, show)
(8, 12, that)
(13, 17, high)
(18, 26, accuracy)
(27, 31, word)
(32, 35, and)
(36, 44, sentence)
(45, 57, segmentation)
(58, 61, can)
(62, 64, be)
(65, 73, achieved)
(74, 76, by)
(77, 82, using)
(83, 93, supervised)
(94, 102, sequence)
(103, 111, labeling)
(112, 114, on)
(115, 118, the)
(119, 128, character)
(129, 134, level)
(135, 143, combined)
(144, 148, with)
(149, 161, unsupervised)
(162, 169, feature)
(170, 178, learning)
(178, 179, .)

(0, 2, We)
(3, 12, evaluated)
(13, 16, our)
(17, 23, method)
(24, 26, on)
(27, 32, three)
(33, 42, languages)
(43, 46, and)
(47, 55, obtained)
(56, 61, error)
(62, 67, rates)
(68, 70, of)
(71, 75, 0.27)
(76, 77, ‰)
(78, 79, ()
(79, 86, English)
(86, 87, ))
(87, 88, ,)
(89, 93, 0.35)
(94, 95, ‰)
(96, 97, ()
(97, 102, Dutch)
(102, 103, ))
(104, 107, and)
(108, 112, 0.76)
(113, 114, ‰)
(115, 116, ()
(116, 123, Italian)
(123, 124, ))
(125, 128, for)
(129, 132, our)
(133, 137, best)
(138, 144, models)
(144, 145, .)

The goal is to achieve two different data types:

sents_with_positions: a list of list of tuples where the the tuples looks like each line of the textfile
sents_words: a list of list of string made up of only the third element in the tuples from each line of the textfile

E.g. From the input textfile:

sents_words = [
    ('Tokenization', 'is', 'widely', 'regarded', 'as', 'a', 'solved',
     'problem', 'due', 'to', 'the', 'high', 'accuracy', 'that', 'rulebased',
     'tokenizers', 'achieve', '.'),
    ('But', 'rule-based', 'tokenizers', 'are', 'hard', 'to', 'maintain', 'and',
     'their', 'rules', 'language', 'specific', '.'),
    ('We', 'show', 'that', 'high', 'accuracy', 'word', 'and', 'sentence',
     'segmentation', 'can', 'be', 'achieved', 'by', 'using', 'supervised',
     'sequence', 'labeling', 'on', 'the', 'character', 'level', 'combined',
     'with', 'unsupervised', 'feature', 'learning', '.')
]

sents_with_positions = [
    [(0, 12, 'Tokenization'), (13, 15, 'is'), (16, 22, 'widely'),
     (23, 31, 'regarded'), (32, 34, 'as'), (35, 36, 'a'), (37, 43, 'solved'),
     (44, 51, 'problem'), (52, 55, 'due'), (56, 58, 'to'), (59, 62, 'the'),
     (63, 67, 'high'), (68, 76, 'accuracy'), (77, 81, 'that'),
     (82, 91, 'rulebased'), (92, 102, 'tokenizers'), (103, 110, 'achieve'),
     (110, 111, '.')],
    [(0, 3, 'But'), (4, 14, 'rule-based'), (15, 25, 'tokenizers'),
     (26, 29, 'are'), (30, 34, 'hard'), (35, 37, 'to'), (38, 46, 'maintain'),
     (47, 50, 'and'), (51, 56, 'their'), (57, 62, 'rules'),
     (63, 71, 'language'), (72, 80, 'specific'), (80, 81, '.')],
    [(0, 2, 'We'), (3, 7, 'show'), (8, 12, 'that'), (13, 17, 'high'),
     (18, 26, 'accuracy'), (27, 31, 'word'), (32, 35, 'and'),
     (36, 44, 'sentence'), (45, 57, 'segmentation'), (58, 61, 'can'),
     (62, 64, 'be'), (65, 73, 'achieved'), (74, 76, 'by'), (77, 82, 'using'),
     (83, 93, 'supervised'), (94, 102, 'sequence'), (103, 111, 'labeling'),
     (112, 114, 'on'), (115, 118, 'the'), (119, 128, 'character'),
     (129, 134, 'level'), (135, 143, 'combined'), (144, 148, 'with'),
     (149, 161, 'unsupervised'), (162, 169, 'feature'), (170, 178, 'learning'),
     (178, 179, '.')]
]

I have been doing it by:

iterating through each line of the textfile, process the tuple, and then appending them to a list to get sents_with_positions
and while appending each process sentence to sents_with_positions, I append the last elements of the tuples for each sentence to sents_words

Code:

sents_with_positions = []
sents_words = []
_sent = []
for line in _input.split('\n'):
    if len(line.strip()) > 0:
        line = line[1:-1]
        start, _, next = line.partition(',')
        end, _, next = next.partition(',')
        text = next.strip()
        _sent.append((int(start), int(end), text))
    else:
        sents_with_positions.append(_sent)
        sents_words.append(list(zip(*_sent))[2])
        _sent = []

But is there a simpler way or cleaner way to do achieve the same output? Maybe through regexes? Or some itertools trick?

Note that there are cases where there're tricky tuples in the lines of the textfile, e.g.

(86, 87, )) # Sometimes the token/word is a bracket
(96, 97, ()
(87, 88, ,) # Sometimes the token/word is a comma
(29, 33, Café) # The token/word is a unicode (sometimes accented), so [a-zA-Z] might be insufficient
(2, 3, 2) # Sometimes the token/word is a number
(47, 52, 3,000) # Sometimes the token/word is a number/word with comma
(23, 29, (e.g.)) # Someimtes the token/word contains bracket.

216

asked Dec 22 '15 12:12

alvas

1 Answers

This is, in my opinion, a little more readable and clear, but it may be a little less performant and assumes the input file is correctly formatted (e.g. empty lines are really empty, while your code works even if there is some random whitespace in the "empty" lines). It leverages regex groups, they do all the work of parsing the lines, we just convert start and end to integers.

line_regex = re.compile('^\((\d+), (\d+), (.+)\)$', re.MULTILINE)
sents_with_positions = []
sents_words = []

for section in _input.split('\n\n'):
    words_with_positions = [
        (int(start), int(end), text)
        for start, end, text in line_regex.findall(section)
    ]
    words = tuple(t[2] for t in words_with_positions)
    sents_with_positions.append(words_with_positions)
    sents_words.append(words)

answered Oct 05 '22 00:10

LeartS

Related questions
                            
                                What is the meaning of string argument in django model's Field?
                            
                                Django Rest Framework 3.0 to_representation not implemented
                            
                                Python3.4 can't install mysql-python
                            
                                Get the id of the object recently created Django Rest Framework
                            
                                TypeError: Type str doesn't support the buffer API when splitting string
                            
                                How do I create a pie chart using Bokeh?
                            
                                Selenium/PhantomJS raises error
                            
                                Error importing Polygon from shapely.geometry.polygon
                            
                                How to get test cases list in Robot Framework without launching the actual tests?
                            
                                Extracting a dictionary from an RDD in Pyspark
                            
                                How to enable CORS on Google App Engine Python Server?
                            
                                Python - Reading Emoji Unicode Characters
                            
                                Extract sender's email address from Outlook Exchange in Python using win32
                            
                                Django - custom 403 template
                            
                                get_xticklabels() contains empty text instances
                            
                                Is it possible to align a print statement to the center in Python?
                            
                                passing bash array to python list
                            
                                argrelextrema and flat extrema
                            
                                Python Pandas: String Contains and Doesn't Contain
                            
                                Cost of using 10**9 over 1000000000?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unpacking tuple-like textfile

Tags:

python

regex

list

tuples

alvas

People also ask

1 Answers

LeartS

Recent Activity

Donate For Us