Overview of Estimating Missing Values

Overview

Missing (null) values can lead to erroneous conclusions about data. Similarly, substitution of missing values may introduce inaccuracies and inconsistencies. Missing data values can negatively impact discovery results, and errors or data skews can proliferate across subsequent runs and cause a larger, cumulative error effect. As well, most analysis methods cannot be performed if there are missing values in the data.

Missing values may prevent proper classification, and poor substitution schemes for missing values may cause classification errors. If all the values substituted are determined by the most likely value, then the individual values are less likely to help define class (cluster) boundaries.

Actions

Two Step Process for Resolving Missing Values:

1. Remove (filter out) genes that have a minimum number of missing values.

Eliminate genes with a high number of missing values, since estimating high numbers of missing values may introduce bias to further analysis. The criteria to eliminate genes with missing values may be situation-dependent.
If you set the elimination threshold value to 1, all genes with missing values are removed.

2. Replace the remaining missing values. GeneLinker™ offers three techniques for estimating missing values:

Estimating values by a measure of central tendency;
Estimating missing values by nearest neighbors;
Replacing missing values with an arbitrary value.