Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to develop a program to minimize errors in human transcription of hand written surveys

I need to develop custom software to do surveys. Questions may be of multiple choice, or free text in a very few cases.

I was asked to design a subsystem to check if there is any error in the manual data entry for the multiple choices part. We're trying to speed up the user data entry process and to minimize human input differences between digital forms and the original questionnaires. The surveys are filled with handwritten marks and text by human interviewers, so it's possible to find hard to read marks, or also the user could accidentally select a different value in some question, and we would like to avoid that.

The software must include some automatic control to detect possible typing differences. Each answer of the multiple choice questions has the same probability of being selected.

This question has two parts:

  • The GUI.

The most simple thing I have in mind is to implement the most usable design of the questions display: use of large and readable fonts and space generously the choices. Is there something else? For faster input, I would like to use drop down lists (favoring keyboard over mouse). Given the questions are grouped in sections, I would like to show the answers selected for the questions of that section, but this could slow down the process. Any other ideas?

  • The error checking subsystem.

What else can I do to minimize or to check human typos in the multiple choice questions? Is this a solvable problem? is there some statistical methodology to check values that were entered by the users are the same from the hand filled forms? For example, let's suppose the survey has 5 questions, and each has 4 options. Let's say I have n survey forms filled in paper by interviewers, and they're ready to be entered in the software, then how to minimize the accidental differences that can have the manual transcription of the n surveys, without having to double check everything in the 5 questions of the n surveys?

My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances, but on what criteria can I make this selection? This validation would be enough to cover everything in a significant way?

The actual survey is nation level and it has 56 pages with over 200 questions in total, so it will be a lot of hand written pages by many people, and the intention is to reduce the likelihood of errors and to optimize speed in the data entry process. The surveys must filled in paper first, given the complications of taking laptops or handhelds with the interviewers.

like image 621
Alex. S. Avatar asked Jun 04 '10 05:06

Alex. S.


People also ask

What are common errors in data entry?

Mistakes can happen, like entering correct data in the wrong column or field or entering data information multiple times. They're more likely than you think. Inputting additional unnecessary information is also possible, which the software may be unable to sort out.

How do you calculate data entry error rate?

It is defined as the number of errors divided by the total number of data. In practice, estimate of this rate can be obtained by counting the number of errors and dividing it by the total number of verified data, i.


3 Answers

Call me old-school, but I still think the most pragmatic way to do this is to use double entry. Two data entry clerks enter their surveys, then swap stacks and enter the other clerk's surveys. Whenever your system detects a difference between the two, it throws up a flag - then the two clerks put their heads together and decide on the correct answer (or maybe it gets reviewed by a more senior research staff member, etc.). Combined with some of the other suggestions here (I like mdma's suggestions for the GUI a lot), this would make for a low-error system.

Yes, this will double your data entry time (maybe) - but it's dead simple and will cut your errors way, way down. The OMR idea is a great one, but it doesn't sound to me like this project (a national, 52-page survey) is the best case for a lone hacker to try to implement that for the first time. What software do you need? What hardware is available to do that? There will still be a lot of human work involved in identifying the goofy stuff where an interviewer marks all four possible answers and then writes a note off to the side - you'll likely want to randomly sample surveys to get a sense of what the machine-read error rate is. Even then you still just have an estimate of the error rate, not corrected data.

Try a simpler method to give your employer quality results this time - then use those results as a pre-validated data set for experimenting with the OMR stuff for next time.

like image 138
Matt Parker Avatar answered Oct 22 '22 12:10

Matt Parker


OCR/OMR is probably the best choice, since you rule out unpredictable human error and replace it with fairly predicatable machine error. It may even be possible to filter out forms that the OCR may struggle with and have these amended to improve scan accuracy.

But, tackling the original question head on:

Error Checking

  • have questions correlated, so that essentially the same thing is asked more than once, or asked again in the negative. If the answers from correlated questions do not also correlate, then this could be an indication of input error.
  • deviations from the norm: if there are patterns in the typical responses then deviations from these typical reponses could be considered potential input errors. E.g. if questions 2 and 3 answer A, then question for is likely to be C or D. This is a generalization of correlation above. The correlations can be computed dynamically based on already inputted data.

GUI

  • have the GUI mimic the paper form, so that what entry clerks see on paper is reflected on the screen. Entering a paper question response into the wrong question in the GUI is less likely then.
  • provide visual assistance to data entry clerks, such as using a slider to maintain the current question location on paper.
  • A custom entry device for inputting the data may be easier to use than keyboard navigation and listboxes. For example, a touch display with all options spelled out A B C D. The clerk only has to hit an option, and it is selected and the next question shown - after a brief pause. In the event the clerk makes an error, they can use the prev/next buttons next to each question.
  • provide audio feedback of entered data, so when the clerk enters "A" they hear "A".

EDIT: If you consider performing dual-entry of data or implementing an improved GUI, it may be worth conducting a pilot scheme to assess the effectiveness of various approaches. Dual-entry can be expensive (doubling the cost of the data entry task) - which may or may not be justified by the improvement in accuracy. A pilot scheme will allow you to assess the effectiveness of dual-entry, quickly and relatively inexpensively. It will also give you an idea of the level of error from a single data entry clerk without any UI changes which can help help determine whether UI changes or other error-reducing strategies are needed and how much cost can be justified in implementing them.

Related links

  • A device that inputs data from multiple choice tests
  • Wikipedia: OMR - Optical Mark Recognition
  • ReadSoft - Automated Data Entry
  • Data capture hardware
like image 6
mdma Avatar answered Oct 22 '22 13:10

mdma


My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances

I don't think this will actually produce a meaningful outcome. Presumably the errors are unintentional and random. Random checks would find systemic errors, but you'll only find 10% of random errors if you double-check 10% of the forms (and 20% of errors if you check 20% of forms, etc).

What do the paper surveys look like? If possible, I would guess that an OCR system which scans the hand-written tests and compares what the OCR detects the answer to be with what the data entry operator gave would be a better solution. You might still end up manually double-checking a fair number of surveys but you'll have some confidence that the surveys you double-check are more likely to contain an error than if you just picked them out at random.

If you also control what the paper surveys look like, then that's even better: you can design them specifically so that OCR can be made as accurate as possible.

like image 5
Dean Harding Avatar answered Oct 22 '22 13:10

Dean Harding