How to efficiently find small typos in source code files?

Question

I would like to recursively search a large code base (mostly python, HTML and javascript) for typos in comments, strings and also variable/method/class names. Strong preference for something that runs in a terminal.

The problem is that spell checkers like aspell or scspell find almost only false positives (e.g. programming terms, camelcased terms) while I would be happy if it could help me primarily find simple typos like scrambled or missing letters e.g. maintenane vs. maintenance, resticted vs. restricted, dpeloyment vs. deployment.

What I was playing with so far is:

for f in **/*.py ; do echo $f ; aspell list < $f |  uniq -c ; done

but it will find anything like: assertEqual, MyTestCase, lifecycle

djangonaut · Accepted Answer

This solution of my own focuses on python files but in the end also found them in html and js. It still needed manual sorting out of false positives but that only took few minutes work and it identified about 150 typos in comments that then also could be found in the non-comment parts.

Save this as executable file e.g extractcomments:

#!/usr/bin/env python3
import argparse
import io
import tokenize


if __name__ == "__main__":
    parser = argparse.ArgumentParser(add_help=False)
    parser.add_argument('filename')
    args = parser.parse_args()

    with io.open(args.filename, "r", encoding="utf-8") as sourcefile:
        for t in tokenize.generate_tokens(sourcefile.readline):
            if t.type == tokenize.COMMENT:
                print(t.string.lstrip("#").strip())

Collect all comments for further processing:

for f in **/*.py ; do  ~/extractcomments $f >> ~/comments.txt ; done

Run it recursively on your code base with one or more aspell dictionaries and collect all it identified as typos and count their occurrences:

aspell <~/comments.txt --lang=en list|aspell --lang=de list | sort | uniq -c | sort -n > ~/typos.txt

Produces something like:

10 availabe
 8 assignement
 7 hardwird

Take the list without leading numbers, clean out the false positives, copy it to a 2nd file correct.txt and run aspell on it to get desired replacement for each typo: aspell -c correct.txt

Now paste the two files to get a format of typo;correction with paste -d";" typos.txt correct.txt > known_typos.csv

Now we want to recursively replace those in our codebase:

#!/bin/bash

root_dir=$(git rev-parse --show-toplevel)

while IFS=";" read -r typo fix ; do
    git grep -l -z -w "${typo}" -- "*.py" "*.html"  | xargs -r --null sed -i "s/\b${typo}\b/${fix}/g"
done < $root_dir/known_typos.csv

My bash skills are poor so there is certainly space for improvement.

Update: I could find more typos in method names by running this:

grep -r def --include \*.py . | cut -d ":" -f 2- |tr "_" " " | aspell --lang=en list | sort -u

Update2: Managed to fix typos that are e.g. inside underscored names or strings that do not have word boundaries as such e.g i_am_a_typpo3:

#!/bin/bash                                                                                                                         

root_dir=$(git rev-parse --show-toplevel)                                                                                           
while IFS=";" read -r typo fix ; do                                                                                                 
    echo ${typo}                                                                                                                    
    find $root_dir  $ -name '*.py' -or -name '*.html' $ -print0 | xargs -0 perl -pi -e "s/(?<![a-zA-Z])${typo}(?![a-zA-Z])/${fix}/g"                                                                                                                    
done < $root_dir/known_typos.csv

srfrnk · Answer

If you're using typescript you could use the gulp plugin i created for spellchecking: https://www.npmjs.com/package/gulp-ts-spellcheck

tedw · Answer

If you are developing in JavaScript or Typescript then you can this spell check plugin for ESLint:

https://www.npmjs.com/package/eslint-plugin-spellcheck

I found it to be very useful.

Another option is scspell:

https://github.com/myint/scspell

It is language-agnostic and claims to "usually catch many errors without an annoying false positive rate."

How to efficiently find small typos in source code files?

Tags:

python

lint

spell-checking

aspell

djangonaut

Video Answer

3 Answers

djangonaut

srfrnk

tedw

Recent Activity

Donate For Us

How to efficiently find small typos in source code files?

Tags:

python

lint

spell-checking

aspell

djangonaut

Video Answer

3 Answers

djangonaut

srfrnk

tedw

Related questions

Recent Activity

Donate For Us