Difference between revisions of "Information Gain"

From Species-ID
Jump to: navigation, search
(Created page with "The following algorithm describes an application of information gain algorithms to character selection. It was adapted by Christian Reitwießner in collaboration with Gregor Hage...")
 
m (Characters coded as "Not applicable")
Line 7: Line 7:
 
==Characters coded as "Not applicable"==
 
==Characters coded as "Not applicable"==
  
The treatment of not-applicable (DELTA "-", SDD: Coding status term with PresenceOfInformation = "CannotExist") is currently under discussion. G. Hagedorn argues that it contributes to Information Gain and that, unlike other coding status values, this one should be treated identical to a state. Example: leaf tip, some taxa have no leaves at all, so leaf tip is not-applicable (documented either through character inapplicability rules / character dependency, or through explicit coding status). 20 taxa have round round leaf tip, 20 taxa have acute leaf tip, 20 are coded not-applicable. If the user during identification selects "round", only 1/3 of taxa shall remain, the ones that are not-applicable cannot be observed as "round". In contrast, if the last 20 were "unknown/missing data" instead of not applicable, the would remain potential results.
+
The treatment of not-applicable (DELTA "-", SDD: Coding status term with PresenceOfInformation = "CannotExist") is currently under discussion. G. Hagedorn argues that it contributes to Information Gain and that, unlike other coding status values, this one should be treated identical to a state. Example: leaf tip, some taxa have no leaves at all, so leaf tip is not-applicable (documented either through character inapplicability rules / character dependency, or through explicit coding status). 20 taxa have round leaf tip, 20 taxa have acute leaf tip, 20 are coded not-applicable. If the user during identification selects "round", only 1/3 of taxa shall remain, the ones that are not-applicable cannot be observed as "round". In contrast, if the last 20 were "unknown/missing data" instead of not applicable, they would remain potential results.
  
 
The calculation of information gain would be for the case of 60 taxa, 1/3 having two states, 1/3 being not-applicable, no polymorphism): log_2(60) - (3 × 1/3 × log_2(20)) = 1.585. The last third being unknown, it would be: log_2(60) - (2 × 1/2 × log_2(20+20)) = 0.585.
 
The calculation of information gain would be for the case of 60 taxa, 1/3 having two states, 1/3 being not-applicable, no polymorphism): log_2(60) - (3 × 1/3 × log_2(20)) = 1.585. The last third being unknown, it would be: log_2(60) - (2 × 1/2 × log_2(20+20)) = 0.585.

Revision as of 20:44, 5 April 2011

The following algorithm describes an application of information gain algorithms to character selection. It was adapted by Christian Reitwießner in collaboration with Gregor Hagedorn:

The total number of characters is "m", a character "c" has "s" states, "n_i" is the number of taxa having state "s_i". "u" is the number of taxa for which no information for a character is recorded (missing data or an equivalent coding status value (e.g. in DELTA "U" or "V", in SDD any coding status with PresenceOfInformation: "NotEvaluated", "DoesNotExist", or "Exists", but not "CannotExist").

Each state s_i has an entropy H_i = log_2(u + n_i) and a probability p_i = n_i / (m - u). The mean entropy of the entire character is then H = \sum_i p_i * H_i and the Information Gain: IG = log_2(m) - H.

Characters coded as "Not applicable"

The treatment of not-applicable (DELTA "-", SDD: Coding status term with PresenceOfInformation = "CannotExist") is currently under discussion. G. Hagedorn argues that it contributes to Information Gain and that, unlike other coding status values, this one should be treated identical to a state. Example: leaf tip, some taxa have no leaves at all, so leaf tip is not-applicable (documented either through character inapplicability rules / character dependency, or through explicit coding status). 20 taxa have round leaf tip, 20 taxa have acute leaf tip, 20 are coded not-applicable. If the user during identification selects "round", only 1/3 of taxa shall remain, the ones that are not-applicable cannot be observed as "round". In contrast, if the last 20 were "unknown/missing data" instead of not applicable, they would remain potential results.

The calculation of information gain would be for the case of 60 taxa, 1/3 having two states, 1/3 being not-applicable, no polymorphism): log_2(60) - (3 × 1/3 × log_2(20)) = 1.585. The last third being unknown, it would be: log_2(60) - (2 × 1/2 × log_2(20+20)) = 0.585.

Polymorphisms including coding status

A polymorphism is: "in the present taxon, the present character has state values 1 or 2". Combinations like "in the present taxon, the present character is unknown or has state value 1" initially appear non-sensical, but do happen in data aggregation: Character is aggregated to Genus level, 2 species have state 1, 2 species are unknown.

Possible handling: 1. For the case of not-applicable, keep the polymorphism. For the case of unknown or other coding status values: drop the normal states from further consideration.