I need to develop custom software to do surveys. Questions may be of multiple choice, or free text in a very few cases.
I was asked to design a subsystem to check if there is any error in the manual data entry for the multiple choices part. We're trying to speed up the user data entry process and to minimize human input differences between digital forms and the original questionnaires. The surveys are filled with handwritten marks and text by human interviewers, so it's possible to find hard to read marks, or also the user could accidentally select a different value in some question, and we would like to avoid that.
The software must include some automatic control to detect possible typing differences. Each answer of the multiple choice questions has the same probability of being selected.
This question has two parts:
The most simple thing I have in mind is to implement the most usable design of the questions display: use of large and readable fonts and space generously the choices. Is there something else? For faster input, I would like to use drop down lists (favoring keyboard over mouse). Given the questions are grouped in sections, I would like to show the answers selected for the questions of that section, but this could slow down the process. Any other ideas?
What else can I do to minimize or to check human typos in the multiple choice questions? Is this a solvable problem? is there some statistical methodology to check values that were entered by the users are the same from the hand filled forms? For example, let's suppose the survey has 5 questions, and each has 4 options. Let's say I have n survey forms filled in paper by interviewers, and they're ready to be entered in the software, then how to minimize the accidental differences that can have the manual transcription of the n surveys, without having to double check everything in the 5 questions of the n surveys?
My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances, but on what criteria can I make this selection? This validation would be enough to cover everything in a significant way?
The actual survey is nation level and it has 56 pages with over 200 questions in total, so it will be a lot of hand written pages by many people, and the intention is to reduce the likelihood of errors and to optimize speed in the data entry process. The surveys must filled in paper first, given the complications of taking laptops or handhelds with the interviewers.
Mistakes can happen, like entering correct data in the wrong column or field or entering data information multiple times. They're more likely than you think. Inputting additional unnecessary information is also possible, which the software may be unable to sort out.
It is defined as the number of errors divided by the total number of data. In practice, estimate of this rate can be obtained by counting the number of errors and dividing it by the total number of verified data, i.
Call me old-school, but I still think the most pragmatic way to do this is to use double entry. Two data entry clerks enter their surveys, then swap stacks and enter the other clerk's surveys. Whenever your system detects a difference between the two, it throws up a flag - then the two clerks put their heads together and decide on the correct answer (or maybe it gets reviewed by a more senior research staff member, etc.). Combined with some of the other suggestions here (I like mdma's suggestions for the GUI a lot), this would make for a low-error system.
Yes, this will double your data entry time (maybe) - but it's dead simple and will cut your errors way, way down. The OMR idea is a great one, but it doesn't sound to me like this project (a national, 52-page survey) is the best case for a lone hacker to try to implement that for the first time. What software do you need? What hardware is available to do that? There will still be a lot of human work involved in identifying the goofy stuff where an interviewer marks all four possible answers and then writes a note off to the side - you'll likely want to randomly sample surveys to get a sense of what the machine-read error rate is. Even then you still just have an estimate of the error rate, not corrected data.
Try a simpler method to give your employer quality results this time - then use those results as a pre-validated data set for experimenting with the OMR stuff for next time.
OCR/OMR is probably the best choice, since you rule out unpredictable human error and replace it with fairly predicatable machine error. It may even be possible to filter out forms that the OCR may struggle with and have these amended to improve scan accuracy.
But, tackling the original question head on:
Error Checking
GUI
EDIT: If you consider performing dual-entry of data or implementing an improved GUI, it may be worth conducting a pilot scheme to assess the effectiveness of various approaches. Dual-entry can be expensive (doubling the cost of the data entry task) - which may or may not be justified by the improvement in accuracy. A pilot scheme will allow you to assess the effectiveness of dual-entry, quickly and relatively inexpensively. It will also give you an idea of the level of error from a single data entry clerk without any UI changes which can help help determine whether UI changes or other error-reducing strategies are needed and how much cost can be justified in implementing them.
Related links
My first suggestion is that at the end of the processing of all the hand filled forms, the software could choose some forms randomly to make a double check of the responses in a few instances
I don't think this will actually produce a meaningful outcome. Presumably the errors are unintentional and random. Random checks would find systemic errors, but you'll only find 10% of random errors if you double-check 10% of the forms (and 20% of errors if you check 20% of forms, etc).
What do the paper surveys look like? If possible, I would guess that an OCR system which scans the hand-written tests and compares what the OCR detects the answer to be with what the data entry operator gave would be a better solution. You might still end up manually double-checking a fair number of surveys but you'll have some confidence that the surveys you double-check are more likely to contain an error than if you just picked them out at random.
If you also control what the paper surveys look like, then that's even better: you can design them specifically so that OCR can be made as accurate as possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With