Datasets Overview

Overview

GeneLinker™ imports three different kinds of data: expression data, variables, and gene lists. Of these three, only expression data is absolutely essential, which is why it is imported separately from the other two. However, variables and gene lists are very useful if they are available. Please see Variables Overview and Gene Lists Overview for more information.

The basic requirement for all GeneLinker™’s analysis capabilities is a set of expression values for a number of genes over a number of samples. In GeneLinker™ we refer to this imported data as a root dataset because it lies at the root of a data family, a hierarchy or tree of datasets appearing in the Experiments navigator. (Like many trees in computer programs, these family trees of related datasets grow from the top left to the right and down.)

A root dataset can have any - or none - of the following characteristics associated with it:

Two-Color Data: Data from experiments involving paired dyes (red-green or Cy3-Cy5) can be treated specially by GeneLinker™. Please see Two-Color Data for more information.

Reliability Measures: Each spot or measurement may have associated with it a measure of its reliability or quality. Please see Reliability Measures for more information.

Variables: Each sample in a dataset may have associated with it a variety of phenotypes, experimental factors, treatments or conditions. Please see Variables Overview for more information.

Missing Values: Data may be missing for some genes in some samples, perhaps due to quality control filtering or due to minor version changes between different microarrays. For more information about the handling of missing values, please see Overview of Estimating Missing Values.

There are several mathematical distinctions among expression data which you should be aware of. Here are the most common mathematical classes of data and their significant characteristics.

Abundance Data

Synonyms: Count data, positive abundance data.

Example: Affymetrix data, CodeLink data.

Characteristics: All values are positive (or zero) since this type of data answers the question how many of <something> are there? The <something> might be molecules, but more likely it is some instrumental proxy, like phosphor intensity, which must also be non-negative. The histogram of count data for mRNA abundance is usually strongly peaked near the theoretical minimum of zero and tails off to the right.

Problems:

Zero values are theoretically possible (there may be none of a given thing there), but can cause numerical difficulties when doing various things like converting to ratios (division by zero is undefined) or taking logarithms (log zero is also undefined). Since instrumental measurements of very small values are usually unreliable in practice, it is often a good idea to eliminate zeroes in count data and replace them with some small positive value which lies near or below the instrumental detection limit.

Negative values may occur, but are generally symptomatic of a problem which ought to be fixed. For instance, much abundance data is computed by subtracting a background count from a foreground count. If the background exceeds the foreground, a negative value occurs which should be corrected. A common interpretation of this circumstance is unknown value, probably small.

Ratio Data

Example: Data from two-color experiments. GenePix, Genomic Solutions, Quantarray, ScanArray data.

Characteristics: All values are (theoretically) positive. Ratios are always defined with respect to some baseline or control sample. The histogram for mRNA ratios typically looks a lot like an abundance histogram, strongly tailed to the right. If the data were not too noisy and you could zoom in very tightly you might see that the histogram is peaked at 1.0 instead of near 0.

Data described as Two-Color Data by GeneLinker™ displays and is processed as ratio data. All Two-Color Data is ratio data, but not all ratio data is Two-Color Data.

Problems:

Ratio data can have negative values just like abundance data, most frequently because they are derived from abundances which have the background-subtraction problems described above. Zeros can also occur, and infinities as well if a zero happens to occur in the denominator (control sample) of a given treatment/control pair.

Related to the problem of zeros and infinities is the problem of large unreliable values. If the control value for a given sample is not actually zero, but nonetheless very small and unreliable, then the ratio may be deceptively large – and even more unreliable. It is extremely difficult to diagnose this problem when one only has the ratios to work with, so the user is advised to be careful of this in their data generation and upstream data processing.