Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently find small typos in source code files?

I would like to recursively search a large code base (mostly python, HTML and javascript) for typos in comments, strings and also variable/method/class names. Strong preference for something that runs in a terminal.

The problem is that spell checkers like aspell or scspell find almost only false positives (e.g. programming terms, camelcased terms) while I would be happy if it could help me primarily find simple typos like scrambled or missing letters e.g. maintenane vs. maintenance, resticted vs. restricted, dpeloyment vs. deployment.

What I was playing with so far is:

for f in **/*.py ; do echo $f ; aspell list < $f |  uniq -c ; done

but it will find anything like: assertEqual, MyTestCase, lifecycle

like image 903
djangonaut Avatar asked Mar 17 '17 10:03

djangonaut


Video Answer


3 Answers

This solution of my own focuses on python files but in the end also found them in html and js. It still needed manual sorting out of false positives but that only took few minutes work and it identified about 150 typos in comments that then also could be found in the non-comment parts.

Save this as executable file e.g extractcomments:

#!/usr/bin/env python3
import argparse
import io
import tokenize


if __name__ == "__main__":
    parser = argparse.ArgumentParser(add_help=False)
    parser.add_argument('filename')
    args = parser.parse_args()

    with io.open(args.filename, "r", encoding="utf-8") as sourcefile:
        for t in tokenize.generate_tokens(sourcefile.readline):
            if t.type == tokenize.COMMENT:
                print(t.string.lstrip("#").strip())

Collect all comments for further processing:

for f in **/*.py ; do  ~/extractcomments $f >> ~/comments.txt ; done

Run it recursively on your code base with one or more aspell dictionaries and collect all it identified as typos and count their occurrences:

aspell <~/comments.txt --lang=en list|aspell --lang=de list | sort | uniq -c | sort -n > ~/typos.txt

Produces something like:

10 availabe
 8 assignement
 7 hardwird

Take the list without leading numbers, clean out the false positives, copy it to a 2nd file correct.txt and run aspell on it to get desired replacement for each typo: aspell -c correct.txt

Now paste the two files to get a format of typo;correction with paste -d";" typos.txt correct.txt > known_typos.csv

Now we want to recursively replace those in our codebase:

#!/bin/bash

root_dir=$(git rev-parse --show-toplevel)

while IFS=";" read -r typo fix ; do
    git grep -l -z -w "${typo}" -- "*.py" "*.html"  | xargs -r --null sed -i "s/\b${typo}\b/${fix}/g"
done < $root_dir/known_typos.csv

My bash skills are poor so there is certainly space for improvement.

Update: I could find more typos in method names by running this:

grep -r def --include \*.py . | cut -d ":" -f 2- |tr "_" " " | aspell --lang=en list | sort -u

Update2: Managed to fix typos that are e.g. inside underscored names or strings that do not have word boundaries as such e.g i_am_a_typpo3:

#!/bin/bash                                                                                                                         

root_dir=$(git rev-parse --show-toplevel)                                                                                           
while IFS=";" read -r typo fix ; do                                                                                                 
    echo ${typo}                                                                                                                    
    find $root_dir  \( -name '*.py' -or -name '*.html' \) -print0 | xargs -0 perl -pi -e "s/(?<![a-zA-Z])${typo}(?![a-zA-Z])/${fix}/g"                                                                                                                    
done < $root_dir/known_typos.csv 
like image 153
djangonaut Avatar answered Oct 12 '22 23:10

djangonaut


If you're using typescript you could use the gulp plugin i created for spellchecking: https://www.npmjs.com/package/gulp-ts-spellcheck

like image 35
srfrnk Avatar answered Oct 12 '22 23:10

srfrnk


If you are developing in JavaScript or Typescript then you can this spell check plugin for ESLint:

https://www.npmjs.com/package/eslint-plugin-spellcheck

I found it to be very useful.

Another option is scspell:

https://github.com/myint/scspell

It is language-agnostic and claims to "usually catch many errors without an annoying false positive rate."

like image 32
tedw Avatar answered Oct 12 '22 23:10

tedw