Tutorial 4: Introduction

This tutorial introduces you to Self-Organizing Maps (SOMs). The results of the SOM clustering is viewed in a SOM plot. This tutorial uses Leukemia data to demonstrate how SOMs can be used. The Self-Organizing Map (SOM) is a clustering method with its roots in Artificial Neural Networks [Kohonen2001]. SOMs have been used in the literature to explore several different gene expression datasets [for example, Golub1999; Tamayo1999; Toronen1999; and Hill2000].

Skills You Will Learn:

How to import gene expression data from a file into the GeneLinker database.

How to display summary statistics about a dataset.

How to remove values and genes with missing values.

How to normalize data.

How to perform a SOM clustering experiment.

How to view SOM experiment results in a SOM plot.

How SOMs Work

SOMs work somewhat like K-Means clustering but are a little richer. With K-Means, you choose the number of clusters to fit the data into. For a SOM you choose the shape and size of a network of clusters to fit the data into. In a SOM, we call these clusters 'nodes'. In GeneLinker™, the nodes are arranged in a rectangular grid for which you need to choose the height and the width. Much like for K-Means clustering, you should choose an initial size based on what you suspect about the number of classes in your data.

Like K-Means, a SOM initially populates its nodes or clusters by randomly sampling the data (or randomly generating points in the data space, depending on the initialization option you choose), and then refines the nodes in a systematic fashion. Unlike K-Means clustering, however, a SOM will not force there to be exactly as many clusters as there are nodes, because it is possible for a node to end up without any associated cluster items when the map is complete. A further difference with K-Means clustering is that the SOM automatically provides some information on the similarity between nodes - i.e., how strongly the certain nodes resemble each other.

Overview of the Tutorial Data

Golub et al. (1999) reported on a dataset of gene expression patterns from leukemia patients. The problem was to distinguish acute myeloid leukemia (AML) from acute lymphoblastic leukemia (ALL). They additionally considered the question of whether the cell type (B-cell or T-cell) could be distinguished.

Gene expression levels for 72 patients were measured using Affymetrix™ equipment. This data is available from the website of the Whitehead Institute at MIT. A formatted version of the data is provided with GeneLinker™.

Tutorial Length

This tutorial should take about 30 minutes, depending on how long you spend investigating the data, and how fast your machine is.

If you must stop part way through the tutorial, simply exit the program by selecting Exit from the File menu. The data and experiments you have performed to that point are saved automatically by GeneLinker™. The next time you start GeneLinker™, you can continue on with the next step in the tutorial.