Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Looking for a C++ implementation of the C4.5 algorithm

I've been looking for a C++ implementation of the C4.5 algorithm, but I haven't been able to find one yet. I found Quinlan's C4.5 Release 8, but it's written in C... has anybody seen any open source C++ implementations of the C4.5 algorithm?

I'm thinking about porting the J48 source code (or simply writing a wrapper around the C version) if I can't find an open source C++ implementation out there, but I hope I don't have to do that! Please let me know if you have come across a C++ implementation of the algorithm.

Update

I've been considering the option of writing a thin C++ wrapper around the C implementation of the C5.0 algorithm (C5.0 is the improved version of C4.5). I downloaded and compiled the C implementation of the C5.0 algorithm, but it doesn't look like it's easily portable to C++. The C implementation uses a lot of global variables and simply writing a thin C++ wrapper around the C functions will not result in an object oriented design because each class instance will be modifying the same global parameters. In other words: I will have no encapsulation and that's a pretty basic thing that I need.

In order to get encapsulation I will need to make a full blown port of the C code into C++, which is about the same as porting the Java version (J48) into C++.

Update 2.0

Here are some specific requirements:

  1. Each classifier instance must encapsulate its own data (i.e. no global variables aside from constant ones).
  2. Support the concurrent training of classifiers and the concurrent evaluation of the classifiers.

Here is a good scenario: suppose I'm doing 10-fold cross-validation, I would like to concurrently train 10 decision trees with their respective slice of the training set. If I just run the C program for each slice, I would have to run 10 processes, which is not horrible. However, if I need to classify thousands of data samples in real time, then I would have to start a new process for each sample I want to classify and that's not very efficient.

like image 553
Kiril Avatar asked May 25 '12 15:05

Kiril


People also ask

What is the C4 5 algorithm and how does it work?

The C4. 5 algorithm is used in Data Mining as a Decision Tree Classifier which can be employed to generate a decision, based on a certain sample of data (univariate or multivariate predictors). So, before we dive straight into C4. 5, let's discuss a little about Decision Trees and how they can be used as classifiers.

What does C4 5 uses to determine how much information is gained after a split?

C4. 5 uses the Gain Ratio as the goodness function to split the dataset, unlike ID3 which used the Information Gain. The Information Gain function tends to prefer the features with more categories as they tend to have lower entropy. This results in overfitting of the training data.

What is a C4 5 decision tree?

Trees.  C4.5 is a computer program for inducing classification rules. in the form of decision trees from a set of given instances.  C4.5 is a software extension of the basic ID3 algorithm. designed by Quinlan.

How is C4 5 useful in decision making?

Basically, the C4. 5 decision tree algorithm applies the concept of information entropy to construct a decision tree through learning data; it chooses the attribute with the highest information gain rate in the current sample set as the testing attribute to divide the sample set.


2 Answers

A C++ implementation for C4.5 called YaDT is available here, in the "Decision Trees" section:
http://www.di.unipi.it/~ruggieri/software.html

This is the source code for the last version:
http://www.di.unipi.it/~ruggieri/YaDT/YaDT1.2.5.zip

From the paper where the tool is described:

[...] In this paper, we describe a new from-scratch C++ implementation of a decision tree induction algorithm, which yields entropy-based decision trees in the style of C4.5. The implementation is called YaDT, an acronym for Yet another Decision Tree builder. The intended contribution of this paper is to present the design principles of the implementation that allowed for obtaining a highly efficient system. We discuss our choices on memory representation and modelling of data and metadata,on the algorithmic optimizations and their effect on memory and time performances, and on the trade-off between efficiency and accuracy of pruning heuristics. [...]

The paper is available here.

like image 126
gRizzlyGR Avatar answered Sep 22 '22 21:09

gRizzlyGR


I may have found a possible C++ "implementation" of C5.0 (See5.0), but I haven't been able to dig into the source code enough to determine if it really works as advertised.

To reiterate my original concerns, the author of the port states the following about the C5.0 algorithm:

Another drawback with See5Sam [C5.0] is the impossibility to have more than one application tree at the same time. An application is read from files each time the executable is run and is stored in global variables here and there.

I will update my answer as soon as I get some time to look into the source code.

Update

It's looking pretty good, here is the C++ interface:

class CMee5
{
  public:

    /**
      Create a See 5 engine from tree/rules files.
      \param pcFileStem The stem of the See 5 file system. The engine
             initialisation will look for the following files:
              - pcFileStem.names Vanilla See 5 names file (mandatory)
              - pcFileStem.tree or pcFileStem.rules Vanilla See 5 tree or rules
                file (mandatory)
              - pcFileStem.costs Vanilla See 5 costs file (mandatory)
    */
    inline CMee5(const char* pcFileStem, bool bUseRules);

    /**
      Release allocated memory for this engine.
    */
    inline ~CMee5();

    /**
      General classification routine accepting a data record.
    */
    inline unsigned int classifyDataRec(DataRec Case, float* pOutConfidence);

    /**
      Show rules that were used to classify the last case.
      Classify() will have set RulesUsed[] to
      number of active rules for trial 0,
      first active rule, second active rule, ..., last active rule,
      number of active rules for trial 1,
      first active rule, second active rule, ..., last active rule,
      and so on.
    */
    inline void showRules(int Spaces);

    /**
      Open file with given extension for read/write with the actual file stem.
    */
    inline FILE* GetFile(String Extension, String RW);

    /**
      Read a raw case from file Df.

      For each attribute, read the attribute value from the file.
      If it is a discrete valued attribute, find the associated no.
      of this attribute value (if the value is unknown this is 0).

      Returns the array of attribute values.
    */
    inline DataRec GetDataRec(FILE *Df, Boolean Train);
    inline DataRec GetDataRecFromVec(float* pfVals, Boolean Train);
    inline float TranslateStringField(int Att, const char* Name);

    inline void Error(int ErrNo, String S1, String S2);

    inline int getMaxClass() const;
    inline int getClassAtt() const;
    inline int getLabelAtt() const;
    inline int getCWtAtt() const;
    inline unsigned int getMaxAtt() const;
    inline const char* getClassName(int nClassNo) const;
    inline char* getIgnoredVals();

    inline void FreeLastCase(void* DVec);
}

I would say that this is the best alternative I've found so far.

like image 25
Kiril Avatar answered Sep 23 '22 21:09

Kiril