Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to slice a string input at a certain unknown index

Tags:

python

A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.

So the input could be (but not limited to) the following:

1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"

(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")

The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text

I only want to extract the question from the garbage input and nothing else.

I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.

def extractQuestion(q):
    index_end_q = q.find('?', 1)
    index_first_letter_of_q = 0 # TODO
    question = '\n ' . join(q[index_first_letter_of_q :index_end_q ])

like image 964
LinkCoder Avatar asked Jul 06 '19 09:07

LinkCoder


1 Answers

A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:

#!/usr/bin/env python

import enchant

GLOSSARY = enchant.Dict("en_US")

def isWord(word):
    return True if GLOSSARY.check(word) else False

sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]

for sentence in sentences:
    for i,w in enumerate(sentence.split()):
        if isWord(w):
            print('index: {} => {}'.format(i, w))
            break

The above piece of code gives as a result:

index: 3 => What
index: 0 => What
index: 0 => Given
like image 164
game0ver Avatar answered Oct 21 '22 23:10

game0ver