Difference between revisions of "Information Gain"
(Created page with "The following algorithm describes an application of information gain algorithms to character selection. It was adapted by Christian Reitwießner in collaboration with Gregor Hage...") |
m (→Characters coded as "Not applicable") |
||
Line 7: | Line 7: | ||
==Characters coded as "Not applicable"== | ==Characters coded as "Not applicable"== | ||
− | The treatment of not-applicable (DELTA "-", SDD: Coding status term with PresenceOfInformation = "CannotExist") is currently under discussion. G. Hagedorn argues that it contributes to Information Gain and that, unlike other coding status values, this one should be treated identical to a state. Example: leaf tip, some taxa have no leaves at all, so leaf tip is not-applicable (documented either through character inapplicability rules / character dependency, or through explicit coding status). 20 taxa have | + | The treatment of not-applicable (DELTA "-", SDD: Coding status term with PresenceOfInformation = "CannotExist") is currently under discussion. G. Hagedorn argues that it contributes to Information Gain and that, unlike other coding status values, this one should be treated identical to a state. Example: leaf tip, some taxa have no leaves at all, so leaf tip is not-applicable (documented either through character inapplicability rules / character dependency, or through explicit coding status). 20 taxa have round leaf tip, 20 taxa have acute leaf tip, 20 are coded not-applicable. If the user during identification selects "round", only 1/3 of taxa shall remain, the ones that are not-applicable cannot be observed as "round". In contrast, if the last 20 were "unknown/missing data" instead of not applicable, they would remain potential results. |
The calculation of information gain would be for the case of 60 taxa, 1/3 having two states, 1/3 being not-applicable, no polymorphism): log_2(60) - (3 × 1/3 × log_2(20)) = 1.585. The last third being unknown, it would be: log_2(60) - (2 × 1/2 × log_2(20+20)) = 0.585. | The calculation of information gain would be for the case of 60 taxa, 1/3 having two states, 1/3 being not-applicable, no polymorphism): log_2(60) - (3 × 1/3 × log_2(20)) = 1.585. The last third being unknown, it would be: log_2(60) - (2 × 1/2 × log_2(20+20)) = 0.585. |
Revision as of 20:44, 5 April 2011
The following algorithm describes an application of information gain algorithms to character selection. It was adapted by Christian Reitwießner in collaboration with Gregor Hagedorn:
The total number of characters is "m", a character "c" has "s" states, "n_i" is the number of taxa having state "s_i". "u" is the number of taxa for which no information for a character is recorded (missing data or an equivalent coding status value (e.g. in DELTA "U" or "V", in SDD any coding status with PresenceOfInformation: "NotEvaluated", "DoesNotExist", or "Exists", but not "CannotExist").
Each state s_i has an entropy H_i = log_2(u + n_i) and a probability p_i = n_i / (m - u). The mean entropy of the entire character is then H = \sum_i p_i * H_i and the Information Gain: IG = log_2(m) - H.
Characters coded as "Not applicable"
The treatment of not-applicable (DELTA "-", SDD: Coding status term with PresenceOfInformation = "CannotExist") is currently under discussion. G. Hagedorn argues that it contributes to Information Gain and that, unlike other coding status values, this one should be treated identical to a state. Example: leaf tip, some taxa have no leaves at all, so leaf tip is not-applicable (documented either through character inapplicability rules / character dependency, or through explicit coding status). 20 taxa have round leaf tip, 20 taxa have acute leaf tip, 20 are coded not-applicable. If the user during identification selects "round", only 1/3 of taxa shall remain, the ones that are not-applicable cannot be observed as "round". In contrast, if the last 20 were "unknown/missing data" instead of not applicable, they would remain potential results.
The calculation of information gain would be for the case of 60 taxa, 1/3 having two states, 1/3 being not-applicable, no polymorphism): log_2(60) - (3 × 1/3 × log_2(20)) = 1.585. The last third being unknown, it would be: log_2(60) - (2 × 1/2 × log_2(20+20)) = 0.585.
Polymorphisms including coding status
A polymorphism is: "in the present taxon, the present character has state values 1 or 2". Combinations like "in the present taxon, the present character is unknown or has state value 1" initially appear non-sensical, but do happen in data aggregation: Character is aggregated to Genus level, 2 species have state 1, 2 species are unknown.
Possible handling: 1. For the case of not-applicable, keep the polymorphism. For the case of unknown or other coding status values: drop the normal states from further consideration.