Introduction
GATree
is a decision tree builder that is based on Genetic Algorithms (GAs). The idea behind
it is rather simple but powerful. Instead of using statistic metrics
that are biased towards specific trees we use a more flexible, global
metric of tree quality that try to optimize accuracy and size. GATree
offers some unique features not to be found in any other tree inducers
while at the same time it can produce better results for many difficult problems.
A few unique characteristics of GATree are as follows:
- The user can control the
characteristics of the output (More Accurate vs Smaller Trees)
- There is no upper bound on its results given that we can
provide the system with unlimited processing power and time
- The system evolves complete solutions to the problem. We can
stop evolution whenever the results are satisfactory
- The system evolves a set of possible solutions (e.g, decision
trees) that closely match the input data. This gives us alternative
hypotheses for the same data
For more information on the program
internals you can read the research paper
(here)
or watch the presentation
(here).
In this small tutorial we will comment on a few screenshots from the
program's interface that uncover its basic functionality. We will also
present a simple step-by-step example with instructions on how to use
and decode the program's output effectively.
Evolve Tab
The main screen of the program (Figure 1) allows us to select an active
training dataset and evolve the decision tree. In the main program's
window we can watch the best decision tree as it evolves through time. The right panel
includes info about the current status of the evolution process.

Figure 1: The main evolution tab
By pressing the Visualize-Decision-Tree
button we can visualize and traverse the final decision tree (Figure 2)

Figure 2: Decision tree visualization
Statistics Tab
The statistics tab (Figure 3) provides
several graphs of the evolution process. Those graphs allows us to
follow in real time the evolution process and discover potential
problems and trends. As an example, when the Average Fitness of the
population tend to be equal to the Fitness of the best Genome then there
is little room for further improvements. A solution here could be to try
with more generations or bigger population size.

Figure 3: The statistics tab
Settings Tab
The settings tab (Figure 4) allows us to control
every aspect of the evolution process. There are two types of settings;
Basic settings and Advanced settings depending on their usefulness and complexity. Below you
can find an explanation of the offered options.

Figure 4: The settings tab
Generations/Population:
Those two options control the number of generation and the initial
population of random decision trees. They effectively control the total
time of evolution but also the quality of the output.
Small/Accurate Trees slider:
A unique characteristic of the genetic evolution is that we can decide
about the generic characteristics of the output.The slider here allows
us to decide whether we would like to search in a bigger or smaller
solution space. If is set to the left it enforces a penalty to all big
trees that excludes them from next generations best population. You can also dynamically
alter this preference by setting if for the beggining and ending of
evolution.This way the program will slowly move during the evolution for
a preference for a big tree to a preference for a smaller tree or vice
versa (advanced settings).
Crossover / Mutation Probability:
From here we control the genetic characteristics of the evolution. Crossover
probability refers to the probability that a random subtree will be
replaced with another subtree. Mutation probability refers to the
probability for a node to be randomly altered to include a new value.
Crossover / Mutation Heuristic:
Those two options allow us to select among several different versions of
the standard crossover/mutation that prefer nodes or good subtrees. The
idea here is that those alterations may improve the speed and accuracy
of evolution.
Random Seed Initializer:
This is a random sequence generator. If we keep the initializer to the
same number then we can expect the same decision tree for the same
settings. Otherwise, the results will tend to be the same but due to the
fuzziness of the genetic evolution, exact same results are not
guaranteed.
Percent of genome replacement:
This parameter control the number of bad trees that will be replaced
with new ones between generations.
Error rate:
This is another speed up parameter. When the error rate of a decision
tree surpass a limit we avoid classifying more testing instances to
preserve resources.
Interface Updater:
It controls how often the interface will update the statistics and the
current version of decision tree.
Save Evolution Log / Crossover:
Those options activate the logging and/or the crossover mode. Below we
explain both in details. Logging
The evolution log (Figure 5) keeps the
best decision trees between generations. This is interesting as a way to
visually examine at a later time how our tree evolved through time.

Figure 5: Evolution log
Cross Validation
Cross validation is a widely used method for accuracy estimation for
models produced by various machine learning algorithms. GATree has
internal support for the cross validation technique (Figure 6).

Figure 6: Cross-Validation
Case Study
In this small example we will illustrate
the effect of different settings on the evolution process, as well as,
how to interpret graphs and results.
We start by selecting a training data set
and a testing data set (Figure 7). The testing data set is used to
evaluate the performance of our model. Make sure that the training and
testing data set use the same data format and their data do not overlap.

Figure 7: Selecting a training and a
testing data set
GATree uses ARFF as its standard source
format (Figure 8). An ARFF file is a simple text file that describes the
problem instances and its attributes. For more info on ARFF format try
this url (Search
for ARFF Format)

Figure 8: The "zoo" dataset at .ARFF
Format
The statistics tab provides useful inside
on the evolution process (Figure 9). The training and testing accuracy
should increase through time. Also, tree size should decrease at the
final steps of the evolution, where the accuracy stabilize and there is
more room for tree structure improvements.

Figure 9: Evolution graphs through time
An interesting characteristic of GATree is
that we can control its bias towards more accurate or smaller decision
trees. To illustrate this effect we go to settings tab and we slide the
trackbar in order to prefer very small decision trees (Figure 10)

Figure 10: Altering settings to bias
evolution towards smaller decision trees
The evolution process with the new
settings produces a much smaller decision tree (Figure 11)

Figure 11: A decision tree that is
biased towards smaller size
The statistics tab gives us an overview of
how the new decision tree fits training and testing data (Figure 12).
The new decision tree fits equally well the testing data but not as good
the training data. However, from just one experiment we cannot be sure
which decision tree is the best for the underlying model.

Figure 12: Evolution graphs for the
altered settings
A much better way to check the quality of
the output, under specific settings, is through a cross-validation
experiment. You can enable this mode from settings tab as illustrated at
Figure 13.

Figure 13: Enabling cross validation
mode
When cross validation is enabled the
program will use only the training data set. It will internally shuffle
and partition the data to as needed buckets and perform evolution to all
of them. Figure 14 shows the cross validation results for the
small tree bias and Figure 15 shows the cross validation results for the
big tree bias. From the results it is clear that for that specific
problem the small bias produces less accurate but a lot smaller decision
trees.

Figure 14: Cross validation for small
trees bias

Figure 15: Cross validation for big
trees bias |