Information Gain

From Species-ID
Jump to: navigation, search

The following algorithm describes an application of information gain algorithms to character selection. It was adapted by Christian Reitwießner in collaboration with Gregor Hagedorn:

Each character "c" has a set of states s(c) = {s_1, s_2, s_3, ...} Let the total number of taxa be "m", "n_i" the number of taxa having state "s_i", and "u" the number of taxa for which no information for a character is recorded (missing data or an equivalent coding status value, e.g. in DELTA "U" or "V", in SDD any coding status with PresenceOfInformation: "NotEvaluated", "DoesNotExist", or "Exists", but not "CannotExist").

For each character c, each state s_i has an entropy H_i = log_2(u + n_i) and a probability p_i = n_i / (m - u). The mean entropy of the entire character is then H(c) = \sum_i p_i * H_i and the Information Gain: IG(c) = log_2(m) - H.

Characters coded as "Not applicable"

The treatment of not-applicable (DELTA "-", SDD: Coding status term with PresenceOfInformation = "CannotExist") is currently under discussion. G. Hagedorn argues that it contributes to Information Gain and that, unlike other coding status values, this one should be treated identical to a state. Example: leaf tip, some taxa have no leaves at all, so leaf tip is not-applicable (documented either through character inapplicability rules / character dependency, or through explicit coding status). 20 taxa have round leaf tip, 20 taxa have acute leaf tip, 20 are coded not-applicable. If the user during identification selects "round", only 1/3 of taxa shall remain, the ones that are not-applicable cannot be observed as "round". In contrast, if the last 20 were "unknown/missing data" instead of not applicable, they would remain potential results.

The calculation of information gain would be for the case of 60 taxa, 1/3 having two states, 1/3 being not-applicable, no polymorphism): log_2(60) - (3 × 1/3 × log_2(20)) = 1.585. The last third being unknown, it would be: log_2(60) - (2 × 1/2 × log_2(20+20)) = 0.585.

Polymorphisms including coding status

A polymorphism is: "in the present taxon, the present character has state values 1 or 2". Combinations like "in the present taxon, the present character is unknown or has state value 1" initially appear non-sensical, but do happen in data aggregation: Character is aggregated to Genus level, 2 species have state 1, 2 species are unknown.

Possible handling: 1. For the case of not-applicable, keep the polymorphism. For the case of unknown or other coding status values: drop the normal states from further consideration.