Agreement Categorical Variables

Kappa is also used to compare machine learning performance, but the directed version, known as Informationdness or Youdens J, is considered more suitable for supervised learning. [20] In the example below, we calculate the correspondence between the first three evaluators: If statistical significance is not a useful indication, what size does Kappa reflect an appropriate match? The guidelines would be useful, but factors other than concordance can influence their size, which poses a problem for the interpretation of a certain order of magnitude. As Sim and Wright noted, two important factors are prevalence (codes are equivalent or vary their probabilities) and distortion (marginal probabilities are similar or different for both observers). If other things are equal, the kappas are higher when the codes are equipable. On the other hand, kappas are higher when codes are distributed asymmetrically by both observers. Unlike variations in probability, the distortion effect is greater when Kappa is small than when it is large. [11]:261-262 Readers are referenced to the following documents that contain compliance measures: po being the relative observed concordance between evaluators (identical to accuracy) and pe is the hypothetical probability of random overconformity, with the observed data being used to calculate the probabilities of each observer who sees each random category. If the evaluators completely match, then κ = 1 {textstyle kappa =1}. If there is no match between the evaluators other than what is expected by chance (as indicated by pe), κ = 0 {textstyle kappa =0}. It is possible that the statistics are negative[6], implying that there is no effective agreement between the two evaluators or that the agreement is worse than chance. . .

.