Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Clang's "did you mean ...?" variable name correction algorithm work?

I am compiling C++ code with Clang. (Apple clang version 12.0.5 (clang-1205.0.22.11)).

Clang can give tips in case you misspell a variable:

#include <iostream>

int main() {
    int my_int;
    std::cout << my_it << std::endl;
}
spellcheck-test.cpp:5:18: error: use of undeclared identifier 'my_it'; did you mean 'my_int'?
    std::cout << my_it << std::endl;
                 ^~~~~
                 my_int
spellcheck-test.cpp:4:9: note: 'my_int' declared here
    int my_int;
        ^
1 error generated.

My question is:

What is the criterion Clang uses to determine when to suggest another variable?

My experimentation suggests it is quite sophisticated:

  • If there is another similarly named variable that you might have meant (e.g. int my_in;) it does not give a suggestion
  • If the suggested variable has the wrong type for the operation (e.g. by trying to print my_it.size() instead) it does not give a suggestion
  • Whether or not it gives the suggestion depends on a non-trivial comparison of variable names: it allows for both deletions and insertions of characters, and longer variable names allow for more insertion/deletions to be considered "similar".
like image 435
Mees de Vries Avatar asked Dec 31 '22 12:12

Mees de Vries


1 Answers

You will not likely find it documented, but as Clang is open-source you can turn to the source to try to figure it out.

Clangd?

The particular diagnostic (from DiagnosticSemaKinds.td):

def err_undeclared_var_use_suggest : Error<
  "use of undeclared identifier %0; did you mean %1?">;

is ever only referred to from clang-tools-extra/clangd/IncludeFixer.cpp:

  // Try to fix unresolved name caused by missing declaration.
  // E.g.
  //   clang::SourceManager SM;
  //          ~~~~~~~~~~~~~
  //          UnresolvedName
  //   or
  //   namespace clang {  SourceManager SM; }
  //                      ~~~~~~~~~~~~~
  //                      UnresolvedName
  // We only attempt to recover a diagnostic if it has the same location as
  // the last seen unresolved name.
  if (DiagLevel >= DiagnosticsEngine::Error &&
      LastUnresolvedName->Loc == Info.getLocation())
    return fixUnresolvedName();

Now, clangd is a language server and t.b.h. I don't know how whether this is actually used by the Clang compiler frontend to yield certain diagnostics, but you're free to continue down the rabbit hole to tie together these details. The fixUnresolvedName above eventually performs a fuzzy search:

if (llvm::Optional<const SymbolSlab *> Syms = fuzzyFindCached(Req))
    return fixesForSymbols(**Syms);

If you want to dig into the details, I would recommend starting with the fuzzyFindCached function:

llvm::Optional<const SymbolSlab *>
IncludeFixer::fuzzyFindCached(const FuzzyFindRequest &Req) const {
  auto ReqStr = llvm::formatv("{0}", toJSON(Req)).str();
  auto I = FuzzyFindCache.find(ReqStr);
  if (I != FuzzyFindCache.end())
    return &I->second;

  if (IndexRequestCount >= IndexRequestLimit)
    return llvm::None;
  IndexRequestCount++;

  SymbolSlab::Builder Matches;
  Index.fuzzyFind(Req, [&](const Symbol &Sym) {
    if (Sym.Name != Req.Query)
      return;
    if (!Sym.IncludeHeaders.empty())
      Matches.insert(Sym);
  });
  auto Syms = std::move(Matches).build();
  auto E = FuzzyFindCache.try_emplace(ReqStr, std::move(Syms));
  return &E.first->second;
}

along with the type of its single function parameter, FuzzyFindRequest in clang/index/Index.h:

struct FuzzyFindRequest {
  /// A query string for the fuzzy find. This is matched against symbols'
  /// un-qualified identifiers and should not contain qualifiers like "::".
  std::string Query;
  /// If this is non-empty, symbols must be in at least one of the scopes
  /// (e.g. namespaces) excluding nested scopes. For example, if a scope "xyz::"
  /// is provided, the matched symbols must be defined in namespace xyz but not
  /// namespace xyz::abc.
  ///
  /// The global scope is "", a top level scope is "foo::", etc.
  std::vector<std::string> Scopes;
  /// If set to true, allow symbols from any scope. Scopes explicitly listed
  /// above will be ranked higher.
  bool AnyScope = false;
  /// The number of top candidates to return. The index may choose to
  /// return more than this, e.g. if it doesn't know which candidates are best.
  llvm::Optional<uint32_t> Limit;
  /// If set to true, only symbols for completion support will be considered.
  bool RestrictForCodeCompletion = false;
  /// Contextually relevant files (e.g. the file we're code-completing in).
  /// Paths should be absolute.
  std::vector<std::string> ProximityPaths;
  /// Preferred types of symbols. These are raw representation of `OpaqueType`.
  std::vector<std::string> PreferredTypes;

  bool operator==(const FuzzyFindRequest &Req) const {
    return std::tie(Query, Scopes, Limit, RestrictForCodeCompletion,
                    ProximityPaths, PreferredTypes) ==
           std::tie(Req.Query, Req.Scopes, Req.Limit,
                    Req.RestrictForCodeCompletion, Req.ProximityPaths,
                    Req.PreferredTypes);
  }
  bool operator!=(const FuzzyFindRequest &Req) const { return !(*this == Req); }
};

Other rabbit holes?

The following commit may be another leg to start from:

[Frontend] Allow attaching an external sema source to compiler instance and extra diags to TypoCorrections

This can be used to append alternative typo corrections to an existing diag. include-fixer can use it to suggest includes to be added.

Differential Revision: https://reviews.llvm.org/D26745

from which we may end up in clang/include/clang/Sema/TypoCorrection.h, which sounds like a more reasonably used feature by the compiler frontend than that of the (clang extra tool) clangd. E.g.:

  /// Gets the "edit distance" of the typo correction from the typo.
  /// If Normalized is true, scale the distance down by the CharDistanceWeight
  /// to return the edit distance in terms of single-character edits.
  unsigned getEditDistance(bool Normalized = true) const {
    if (CharDistance > MaximumDistance || QualifierDistance > MaximumDistance ||
        CallbackDistance > MaximumDistance)
      return InvalidDistance;
    unsigned ED =
        CharDistance * CharDistanceWeight +
        QualifierDistance * QualifierDistanceWeight +
        CallbackDistance * CallbackDistanceWeight;
    if (ED > MaximumDistance)
      return InvalidDistance;
    // Half the CharDistanceWeight is added to ED to simulate rounding since
    // integer division truncates the value (i.e. round-to-nearest-int instead
    // of round-to-zero).
    return Normalized ? NormalizeEditDistance(ED) : ED;
  }

used in clang/lib/Sema/SemaDecl.cpp:

// Callback to only accept typo corrections that have a non-zero edit distance.
// Also only accept corrections that have the same parent decl.
class DifferentNameValidatorCCC final : public CorrectionCandidateCallback {
 public:
  DifferentNameValidatorCCC(ASTContext &Context, FunctionDecl *TypoFD,
                            CXXRecordDecl *Parent)
      : Context(Context), OriginalFD(TypoFD),
        ExpectedParent(Parent ? Parent->getCanonicalDecl() : nullptr) {}

  bool ValidateCandidate(const TypoCorrection &candidate) override {
    if (candidate.getEditDistance() == 0)
      return false;
     // ...
  }
  // ...
};
like image 103
dfrib Avatar answered Jan 02 '23 02:01

dfrib