Platinum

SVM Classification and Prediction Overview

Overview

SVM Classification, in GeneLinker™, is the process of learning to separate samples into different classes. For example, a set of samples may be taken from biopsies of two different tumor types, and their gene expression levels measured. GeneLinker™ can use this data to learn to distinguish the two tumor types so that later, GeneLinker™ can diagnose the tumor types of new biopsies. Because making predictions on unknown samples is often used as a means of testing the SVM classifier, we use the terms training samples and test samples to distinguish between the samples of which GeneLinker™ knows the classes (training), and samples of which GeneLinker™ will predict the classes (test).

Types of Learning

SVM Classification is an example of Supervised Learning. Known class labels help indicate whether the system is performing correctly or not. This information can be used to indicate a desired response, validate the accuracy of the system, or be used to help the system learn to behave correctly. The known class labels can be thought of as supervising the learning process; the term is not meant to imply that you have some sort of interventionist role.

Clustering is an example of Unsupervised Learning where the class labels are not presented to the system that is trying to discover the natural classes in a dataset. Clustering often fails to find known classes because the distinction between the classes can be obscured by the large number of features (genes) which are uncorrelated with the classes. A step in SVM classification involves identifying genes which are intimately connected to the known classes. This is called feature selection or feature extraction. Feature selection and SVM classification together have a use even when prediction of unknown samples is not necessary: They can be used to identify key genes which are involved in whatever processes distinguish the classes.

Manual Feature Selection

Manual feature selection is useful if you already have some hypothesis about which genes are key to a process. You can test that hypothesis by:

i. constructing a gene list of those genes,

ii. running an SVM classifier using those genes as features, and

iii. displaying a plot which shows whether the data can be successfully classified.

Feature Selection Using the SLAM™ Technology

The genes that are frequently observed in associations are frequently good features for classification with artificial neural networks or support vector machines. In GeneLinker™, SVM classification is done using a committee of support vector machines (SVMs). SVMs find an optimal separating hyperplane between data points of different classes in a (possibly) high dimensional space. The actual Support Vectors are the points that form the decision boundary between the classes. More details on support vector machines are available in Tutorial 9. A committee of SVMs is used because an individual SVM may not be robust. That is, it may not make good predictions on new data (test data) despite excellent performance on the training data. Such a learner is referred to as being overtrained.

Each learner (ANN or SVM) is by default trained on a different 90% of the training data and then validated on the remaining 10%. (These fractions can be set differently in the Create ANN Classifier dialog or in the Create SVM Classifier dialog by varying the number of learners.) This technique mitigates the risk of overtraining at the level of the individual learner.

The committee architecture further enhances robustness by combining the component predictions in a voting scheme. Finally, by examining a chart of the voting results, difficult-to-classify samples can often be identified for re-examination or further study.