This tutorial introduces you to data mining and prediction. You will use the integrated SLAM™ technology to mine a dataset for sets of gene associations. A gene list will be created from the most interesting features (genes). You will create and evaluate an ANN classifier.
Skills You Will Learn:
How to import gene expression data from a file into the GeneLinker™ database.
How to import variable class data.
How to discretize expression data.
How to run SLAM.
How to use the SLAM association viewer.
How to create a gene list.
How to create, evaluate, and predict classes using an ANN classifier.
This tutorial is a reanalysis of the data reported by Khan, Wei, Ringnér et al. in Nature Medicine (2001) [Ref.1]. We refer to this paper simply as 'Khan' in this tutorial.
The object of the paper and of this tutorial is to learn to distinguish, at the molecular level, between types of small round blue-cell tumors (SRBCTs) such as Ewing sarcoma (EWS), Burkitt lymphoma (BL), neuroblastoma (NB) and rhabdomyosarcoma (RMS). These tumors are difficult to distinguish by visual methods, and respond to different treatments.
The data is available on the World Wide Web as supplementary material, at http://www.thep.lu.se/pub/Preprints/01/lu_tp_01_06_supp.html. The authors pre-filtered the data for a minimal level of expression, leaving measurements for 2308 genes.
The purpose of the workflow covered by this tutorial is to select a small number of genes (called features) that as a set are able to predict the cancer type of a given tissue sample. Once this small set of genes has been selected by SLAM™, a committee of artificial neural networks (ANNs) is trained using the expression levels of only those genes.
Feature selection and ANN training take place on the same set of data, called the training dataset. The samples in this dataset have known classes, so the ANN training is done under the supervision of this available knowledge. Once the ANN committee has been trained, it can be used on new data of the same phenomenon (SRBCTs), to predict the classes of its samples. This new data is called the test dataset.
This tutorial demonstrates how a combination of SLAM™ and a committee of trained ANNs can be used to effectively classify difficult-to-distinguish cancers using as few as eight genes.
What You Will Learn:
1. How to run SLAM™ and use the results to create gene lists.
2. How to train artificial neural networks (ANNs)
3. How to use trained ANNs to distinguish and predict sample classes.
This tutorial should take about an hour, depending on how long you spend investigating the data, and how fast your machine is.
If you must stop part way through the tutorial, simply exit the program by selecting Exit from the File menu. The data and experiments you have performed to that point are saved automatically by GeneLinker™. The next time you start GeneLinker™, you can continue on with the next step in the tutorial.