I need to develop custom software to do surveys. Questions may be of multiple choice, or free text in a very few cases. I was asked to design a subsystem to check if there is any error in the manual data entry for the multiple choices part. We're trying to speed up the user data entry process and to minimize human input differences between digital forms and the original questionnaires. The surveys are filled with handwritten marks and text by human interviewers, so it's possible to find hard to read marks, or also the user could accidentally select a different value in some question, and we would like to avoid that. The software must include some automatic control to detect possible typing differences. Each answer of the multiple choice questions has the same probability of being selected. This question has two parts: <ul> <li>The GUI.</li> </ul> The most simple thing I have in mind is to implement the most usable design of the questions display: use of large and readable fonts and space generously the choices. Is there something else? For faster input, I would like to use drop down lists (favoring keyboard over mouse). Given the questions are grouped in sections, I would like to show the answers selected for the questions of that section, but this could slow down the process. Any other ideas? <ul> <li>The error checking subsystem.</li> </ul> What else can I do to minimize or to check human typos in the multiple choice questions? Is this a solvable problem? is there some statistical methodology to check values that were entered by the users are the same from the hand filled forms? For example, let's suppose the survey has 5 questions, and each has 4 options. Let's say I have n survey forms filled in paper by interviewers, and they're ready to be entered in the software, then how to minimize the accidental differences that can have the manual transcription of the n surveys, without having to double check everything in the 5 questions of the n surveys? My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances, but on what criteria can I make this selection? This validation would be enough to cover everything in a significant way? The actual survey is nation level and it has 56 pages with over 200 questions in total, so it will be a lot of hand written pages by many people, and the intention is to reduce the likelihood of errors and to optimize speed in the data entry process. The surveys must filled in paper first, given the complications of taking laptops or handhelds with the interviewers.

OCR/OMR is probably the best choice, since you rule out unpredictable human error and replace it with fairly predicatable machine error. It may even be possible to filter out forms that the OCR may struggle with and have these amended to improve scan accuracy. But, tackling the original question head on: Error Checking <ul> <li>have questions correlated, so that essentially the same thing is asked more than once, or asked again in the negative. If the answers from correlated questions do not also correlate, then this could be an indication of input error.</li> <li>deviations from the norm: if there are patterns in the typical responses then deviations from these typical reponses could be considered potential input errors. E.g. if questions 2 and 3 answer A, then question for is likely to be C or D. This is a generalization of correlation above. The correlations can be computed dynamically based on already inputted data.</li> </ul> GUI <ul> <li>have the GUI mimic the paper form, so that what entry clerks see on paper is reflected on the screen. Entering a paper question response into the wrong question in the GUI is less likely then.</li> <li>provide visual assistance to data entry clerks, such as using a slider to maintain the current question location on paper. </li> <li>A custom entry device for inputting the data may be easier to use than keyboard navigation and listboxes. For example, a touch display with all options spelled out A B C D. The clerk only has to hit an option, and it is selected and the next question shown - after a brief pause. In the event the clerk makes an error, they can use the prev/next buttons next to each question.</li> <li>provide audio feedback of entered data, so when the clerk enters "A" they hear "A".</li> </ul> EDIT: If you consider performing dual-entry of data or implementing an improved GUI, it may be worth conducting a pilot scheme to assess the effectiveness of various approaches. Dual-entry can be expensive (doubling the cost of the data entry task) - which may or may not be justified by the improvement in accuracy. A pilot scheme will allow you to assess the effectiveness of dual-entry, quickly and relatively inexpensively. It will also give you an idea of the level of error from a single data entry clerk without any UI changes which can help help determine whether UI changes or other error-reducing strategies are needed and how much cost can be justified in implementing them. Related links <ul> <li>A device that inputs data from multiple choice tests</li> <li>Wikipedia: OMR - Optical Mark Recognition </li> <li> ReadSoft - Automated Data Entry</li> <li>Data capture hardware</li> </ul>

how to develop a program to minimize errors in human transcription of hand written surveys

Tags:

user-interface

algorithm

statistics

survey

I need to develop custom software to do surveys. Questions may be of multiple choice, or free text in a very few cases.

I was asked to design a subsystem to check if there is any error in the manual data entry for the multiple choices part. We're trying to speed up the user data entry process and to minimize human input differences between digital forms and the original questionnaires. The surveys are filled with handwritten marks and text by human interviewers, so it's possible to find hard to read marks, or also the user could accidentally select a different value in some question, and we would like to avoid that.

The software must include some automatic control to detect possible typing differences. Each answer of the multiple choice questions has the same probability of being selected.

This question has two parts:

The GUI.

The most simple thing I have in mind is to implement the most usable design of the questions display: use of large and readable fonts and space generously the choices. Is there something else? For faster input, I would like to use drop down lists (favoring keyboard over mouse). Given the questions are grouped in sections, I would like to show the answers selected for the questions of that section, but this could slow down the process. Any other ideas?

The error checking subsystem.

What else can I do to minimize or to check human typos in the multiple choice questions? Is this a solvable problem? is there some statistical methodology to check values that were entered by the users are the same from the hand filled forms? For example, let's suppose the survey has 5 questions, and each has 4 options. Let's say I have n survey forms filled in paper by interviewers, and they're ready to be entered in the software, then how to minimize the accidental differences that can have the manual transcription of the n surveys, without having to double check everything in the 5 questions of the n surveys?

My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances, but on what criteria can I make this selection? This validation would be enough to cover everything in a significant way?

The actual survey is nation level and it has 56 pages with over 200 questions in total, so it will be a lot of hand written pages by many people, and the intention is to reduce the likelihood of errors and to optimize speed in the data entry process. The surveys must filled in paper first, given the complications of taking laptops or handhelds with the interviewers.

621

asked Jun 04 '10 05:06

Alex. S.

3 Answers

Call me old-school, but I still think the most pragmatic way to do this is to use double entry. Two data entry clerks enter their surveys, then swap stacks and enter the other clerk's surveys. Whenever your system detects a difference between the two, it throws up a flag - then the two clerks put their heads together and decide on the correct answer (or maybe it gets reviewed by a more senior research staff member, etc.). Combined with some of the other suggestions here (I like mdma's suggestions for the GUI a lot), this would make for a low-error system.

Yes, this will double your data entry time (maybe) - but it's dead simple and will cut your errors way, way down. The OMR idea is a great one, but it doesn't sound to me like this project (a national, 52-page survey) is the best case for a lone hacker to try to implement that for the first time. What software do you need? What hardware is available to do that? There will still be a lot of human work involved in identifying the goofy stuff where an interviewer marks all four possible answers and then writes a note off to the side - you'll likely want to randomly sample surveys to get a sense of what the machine-read error rate is. Even then you still just have an estimate of the error rate, not corrected data.

Try a simpler method to give your employer quality results this time - then use those results as a pre-validated data set for experimenting with the OMR stuff for next time.

138

answered Oct 22 '22 12:10

Matt Parker

OCR/OMR is probably the best choice, since you rule out unpredictable human error and replace it with fairly predicatable machine error. It may even be possible to filter out forms that the OCR may struggle with and have these amended to improve scan accuracy.

But, tackling the original question head on:

Error Checking

have questions correlated, so that essentially the same thing is asked more than once, or asked again in the negative. If the answers from correlated questions do not also correlate, then this could be an indication of input error.
deviations from the norm: if there are patterns in the typical responses then deviations from these typical reponses could be considered potential input errors. E.g. if questions 2 and 3 answer A, then question for is likely to be C or D. This is a generalization of correlation above. The correlations can be computed dynamically based on already inputted data.

GUI

have the GUI mimic the paper form, so that what entry clerks see on paper is reflected on the screen. Entering a paper question response into the wrong question in the GUI is less likely then.
provide visual assistance to data entry clerks, such as using a slider to maintain the current question location on paper.
A custom entry device for inputting the data may be easier to use than keyboard navigation and listboxes. For example, a touch display with all options spelled out A B C D. The clerk only has to hit an option, and it is selected and the next question shown - after a brief pause. In the event the clerk makes an error, they can use the prev/next buttons next to each question.
provide audio feedback of entered data, so when the clerk enters "A" they hear "A".

EDIT: If you consider performing dual-entry of data or implementing an improved GUI, it may be worth conducting a pilot scheme to assess the effectiveness of various approaches. Dual-entry can be expensive (doubling the cost of the data entry task) - which may or may not be justified by the improvement in accuracy. A pilot scheme will allow you to assess the effectiveness of dual-entry, quickly and relatively inexpensively. It will also give you an idea of the level of error from a single data entry clerk without any UI changes which can help help determine whether UI changes or other error-reducing strategies are needed and how much cost can be justified in implementing them.

mdma

My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances

I don't think this will actually produce a meaningful outcome. Presumably the errors are unintentional and random. Random checks would find systemic errors, but you'll only find 10% of random errors if you double-check 10% of the forms (and 20% of errors if you check 20% of forms, etc).

What do the paper surveys look like? If possible, I would guess that an OCR system which scans the hand-written tests and compares what the OCR detects the answer to be with what the data entry operator gave would be a better solution. You might still end up manually double-checking a fair number of surveys but you'll have some confidence that the surveys you double-check are more likely to contain an error than if you just picked them out at random.

If you also control what the paper surveys look like, then that's even better: you can design them specifically so that OCR can be made as accurate as possible.

answered Oct 22 '22 13:10

Dean Harding

Related questions
                            
                                solve maximum product of three array elements without sorting
                            
                                Finding unique numbers from sorted array in less than O(n)
                            
                                How to find the length of a linked list that is having cycles in it?
                            
                                Three-way conditional in c++ to determine sign equivalance of two numbers
                            
                                Python's underlying hash data structure for dictionaries
                            
                                Check if two arrays are cyclic permutations
                            
                                0/1 Knapsack Dynamic Programming Optimization, from 2D matrix to 1D matrix
                            
                                Poor man's authentication algorithm?
                            
                                Best way to find first non repeating character in a string [duplicate]
                            
                                Difference between the time complexity required to build Binary search tree and AVL tree?
                            
                                how to calculate Bubble sort Time Complexity
                            
                                find four elements in array whose sum equal to a given number X [closed]
                            
                                All cases covered Bresenham's line-algorithm [closed]
                            
                                Mahjong - Arrange tiles to ensure at least one path to victory, regardless of layout
                            
                                Algorithm (Python): find the smallest number greater than k
                            
                                How to know when Big O is Logarithmic?
                            
                                Best, worst and average case running times
                            
                                make a unique hash out of two strings
                            
                                Check if a number is rational in Python, for a given fp accuracy
                            
                                Distinguishing extra element from two arrays?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With