Inference and aggregation algorithms

From Species-ID
Jump to: navigation, search

Information from descriptions can be aggregated along various dimensions: The taxonomic assignment and hierarchy, but also structure by various scopes (sex, geographic origin, published sources, etc.).

Within the taxonomic hierarchy, two directions can be distinguished:

1. Deduction (Wikipedia) in which information from higher taxa (ie. more general classes) is transferred to lower taxa, for which no specific information is available.

  • Example: The Tracheophyta (vascular plants) may be annotated as having "usually green leaves". If the leaf color of a given species is not specifically mentioned, it can be deduced (inferred by deduction), that it is "probably green".
  • Note: this is slightly different from an inference that in the absence of specific information, the leaf is certainly green. Although it is likely in the example, that leaf color of all species with non-green leaves is certainly recorded on a specific level (at the lower taxon), the inference can not be guaranteed. It may be supported by additional information about default states for a character (DELTA: IMPLICIT VALUES).

2. Induction (Wikipedia) or Generalization in which information from lower taxa (i.e. less general classes) is transferred to descriptions of higher taxa.

  • Example: a missing genus description is automatically created from multiple existing species descriptions. The assumption is that the available species descriptions are either complete, or sufficiently representative.
  • For the purpose of specific searches, it is important not to simply aggregate all information on the character level. Example by Régine: "A simple generalization can include too many combinations of character states. For example:
    - species 1 of mushroom has white cap and brown stem
    - species 2 of mushroom has brown cap and white stem

With a simple aggregation-type generalization the genus description will be the Cartesian product it means “white or brown cap and white or brown stem” if no additional character excludes “cap and stem with the same colour”. In this case, even if the species descriptions are perfectly distinct, the descriptions of the genera may overlap."

    • The problem above can be addressed by maintaining character-state relations of the generalized description in multiple descriptions ("description containers"). For the purpose of natural language descriptions, these could then be finally (over-) generalized, whereas for identification purposes the decriptions container may provide "brackets" around the actual combinations. For a simple model of character and states no separate, this may even directly use the original species descriptions for identification (as almost all multi-access identification tools do to avoid over-generalization). However, in the presence of frequency or other modifiers and state annotations, the creation of new descriptions containers may be desirable.

For the purpose of discussion it is important to use clear terminology. Proposal:

  • The term inference (Wikipedia) is restricted to "inferences" in the sense of machine reasoning as on the semantic web. The Wikipedia definition restricts inference to deductive reasoning, excluding inductive reasoning. THIS NEEDS FURTHER CLARIFICATION - will use inference/aggregation below to keep issue open.
    • The term inference (Wikipedia)covers deductive, inductive and abductive reasoning. (Régine)
  • Should the term aggregation be used as a general term, avoiding the use of "inference"? THIS NEEDS FURTHER CLARIFICATION
    • Inference is the good term. Aggregation is related to a bottom up process to put together information. Inference refers to reasoning (bottom up or top down). (Régine)
  • The term inductive and deductive are used to indicate the direction of inference/aggregation.
    • Yes, but there is also abductive reasoning (and the identification process is an abductive reasoning). (Régine)
  • In addition inference/aggregation may be based on rules about characters that can be calculated (e.g. a length/width ratio).

The process of automated inference/aggregation should be separated from the question of information aggregation in general. Descriptive information is often available only in pre-aggregated or pre-calculated form:

  • A genus description may be entered by humans and is not accessible to "algorithmic updating"
  • A length/width-ratio or a mean value may be available, the length, width, or original sample data may not.

As a result, descriptive information may in part be manually entered, and in part be automatically aggregated or calculated. In cases where inferred/aggregated information needs to be store along original information (e.g. for downstream processing, like natural language output), it is desirable to use standard mechanisms (Descriptions, characters, states, modifiers, notes) which differ only in an annotation.

For example, SDD provides the following vocabulary for data origin ("DataOriginEnum"):

OriginalDataThe data are directly entered by a machine or human agent. These are the original data all other cached data (Origin unequal 'OriginalData') are based upon.
CalculatedThe data are calculated from other data using a calculation rule. Examples: a ratio calculated from other characters, a mean calculated from a sample that is available under SampleData/Sample (if a mean is calculated from data no longer available, it would be recorded as 'OriginalData').
MappedThe data are calculated from other data based on a mapping definition (either from numeric to categorical, or from fine-grained categorical to coarse-grained categorical.
AggregatedThe data are derived from data in classes placed below the current class in the class hierarchy. This applies both to aggregating data from objects to classes, as generalizing lower classes to higher classes. Note: BioLink calls this 'Compile from below'.
InheritedThe data are derived from data in classes placed above the current class in the class hierarchy.

Please critisize/discuss the above - it may well be insufficient.

Related old SDD discussion document: