León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                387                           Science of Gymnastics Journal 
 
AN EFFICIENT METHOD TO EVALUATE AND ENHANCE 
SPORT JUDGES’ PERFORMANCE DURING COMPETITION: 
A CASE STUDY IN ACROBATIC GYMNASTICS 
 
 
Juan Antonio León-Prados1 & Carmen Ramos2 
 
 
1 Physical Performance & Sport Research Center. Pablo de Olavide University, Seville, 
Spain.  
2 Statistic and Operational Research Department. Social and Communication Sciences 
Faculty, Cadiz University, Cadiz, Spain  
 
 
Original article                                                                           DOI: 10.52165/sgj.16.3.387-400 
 Abstract 
Who judges the gymnastics judges? How do we measure their accuracy and concordance? How 
do we know if the judging process is fair? The superior jury has this responsibility. However, they 
normally lack time to provide effective feedback during competitions. 
Using macro functions to process statistical and mathematical statements, we designed and 
validated an automated Excel-based tool called the Automatic Acrobatic Gymnastics Judges 
Individual Report Tool to evaluate judges’ performances quickly and easily during competition, 
automatically creating and exporting individual reports showing each judge’s accuracy and 
concordance performance on a daily basis, rather than after the competition is ended. 
We present empirical data for 76 experienced international judges evaluating acrobatic 
gymnastics routines in four major official events. A total of 1240 individual reports were analyzed 
and sent confidentially to the judges during the competition, and 952 were analyzed to evaluate 
whether this feedback was effective in improving judges’ performance during the competition. 
The tool provides efficient and easily understood evidence-based feedback on acrobatic 
gymnastics judges’ performance during competition, quickly and automatically creating, 
analyzing and sending individualized information to judges, thus helping with specific Technical 
Committee scoring control tasks during competitions. We suggest that judges' performances 
remain high or are enhanced after receiving daily evaluation during major competition events. 
 
Keywords: Excel-based tool, Acrobatic Gymnastic, Judges, Evaluation, ACROAJIR® 
 
INTRODUCTION 
 
Judges are often criticized and 
sometimes undervalued in sports. In 
subjectively assessed sports, judges may 
collude, giving higher scores to their own 
athletes and lower scores to others. To 
prevent this, federations implement various 
strategies, such as automatically eliminating 
the highest and lowest scores or involving a 
referee judge (Gambarelli et al., 2012). 
Gymnastics judges must observe and assess 
the quality of performances, often 
processing large amounts of information 
(Dosseville et al., 2014). Their scores can be 
influenced by factors such as their viewing 
position (Dallas et al., 2011; Plessner & 
Schallies, 2005), serial position bias 
(Plessner & Schallies, 2005; Fasold et al., 
2012; De Bruin, 2005), conformity bias 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                388                           Science of Gymnastics Journal 
(Auweele et al., 2004; Boen et al., 2006, 
2008, 2013), or the performance of the 
preceding gymnast (Damisch et al., 2006; 
Kramer, 2017). 
Knowledge, experience, and 
psychological factors such as attention, 
emotion recognition, and possible 
interventions may reduce judges' biases or 
stress, helping to avoid scoring mistakes 
(Flessas et al., 2015; Ste-Marie, 2000; Van 
Bokhorst et al., 2016). These factors can 
influence the outcome in sports where 
scoring and ranking depend on subjective 
evaluations. Gymnastics judges' 
performance can vary widely, but who 
judges the judges? Typically, other judges 
form a superior jury, whose evaluations 
occur post-competition. Some research has 
examined judges’ overall performance in 
terms of reliability or concordance after 
events (Bučar et al., 2012; León-Prados & 
Jemni, 2022; Leskošek et al., 2018; Mercier 
& Heiniger, 2018; Premelč et al., 2019); 
however, none has focused on judges’ work 
during competitions on a day-to-day basis. 
Such evaluation requires careful 
monitoring of judges' accuracy and 
concordance, which demands significant 
time and effort at the end of each 
competition day. In real events, this can be 
challenging, as judges are often fatigued, 
and statistical and mathematical expertise is 
needed for these evaluations. 
The FIG, in collaboration with 
Longines and the Université de Neuchâtel, 
designed and implemented the Judge 
Evaluation Program (JEP) for five 
gymnastics disciplines: Artistic, Acrobatic, 
Aerobic, Rhythmic Gymnastics, and 
Trampoline. This program analyzed the 
marks given by execution judges at 
international competitions during the 2013–
2016 Olympic cycle (Heiniger & Mercier, 
2021; Heiniger & Mercier, 2018; Mercier & 
Heiniger, 2018; Mercier & Klahn, 2017). 
The authors claimed that the JEP helps to 
ensure judges’ objectivity during gymnastics 
competitions, allowing for post-competition 
analysis and an overall evaluation of judges 
by the respective Technical Committees 
(TCs). This post-competition control can be 
applied in competitions where the use of 
IRCOS (Instant Replay & Control System) 
is mandatory (FIG, 2020). Judges’ scores 
must demonstrate accuracy, precision, 
consistency, and the absence of bias. The 
JEP evaluates gymnastics judges’ 
performance compared to their peers, 
distinguishing between erratic and precise 
judges and detecting potential cheating or 
unintentional misjudging. 
Since its inception in 2006, the JEP has 
evolved iteratively, although earlier versions 
were criticized for using unsound and 
inaccurate mathematical tools that didn’t 
always evaluate what was intended. 
However, a new core statistical engine 
introduced in the 2013–2016 Olympic cycle 
provided more reliable feedback to judges 
and executive committees (Mercier & 
Heiniger, 2018). The FIG typically derives 
control scores using external judging panels 
and post-competition video reviews. This 
post-competition control establishes expert 
scores (considered “true scores") against 
which judges’ scores are compared, ensuring 
evaluation on the fairest possible basis. 
Expert scores are provided by TC members, 
who individually assess each exercise (FIG, 
2015). 
However, previous studies have not 
clarified whether judges received their 
individual results after competitions or if 
they were given specific feedback on their 
performance during each session within 
competition days. The FIG has encouraged 
continental committees and national 
federations to adopt a similar system for 
their own events, which inspired our 
research (FIG, 2020). 
How could we obtain this type of 
information about judges’ performance in 
Acrobatic Gymnastics (ACRO) during real 
competitions? Could rapid daily feedback 
improve judges' performance and lead to 
fairer, more precise judging? Currently, in 
the absence of more objective feedback, 
judges rely on the only available in-
competition feedback—the final trimmed 
mean execution and artistic score displayed 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                389                           Science of Gymnastics Journal 
on the scoreboard. The scores from the in-
competition control panel remain unknown 
to the judges, even after the competition 
ends. 
This study developed and implemented 
the Automatic Acrobatic Gymnastics Judges 
Individual Report Tool (ACROAJIR®) 
(Leon-Prados & Rosales, 2019) as a 
pedagogical tool to evaluate ACRO judges' 
performance in real-time, providing 
objective feedback on their work and 
potential judging impacts. 
 
METHODS 
 
This study had two goals: a) The 
ACROAJIR® design; and b) its practical 
application with judges in real events. 
 
The ACROAJIR® Design 
 
All the official Execution (E) and 
Artistic (A) scores were collected 
confidentially and were provided by 
SmartScoring, the European Gymnastics 
exclusive results service provider (Bakú, 
Azerbaijan). 
 
Control scores validity. Looking for the true 
score 
In practice, true performance level is 
unknown and we must work with 
approximations. In our study, we assumed 
that the highest category judges in the 
Superior Jury who provide the Technical or 
Artistic Control Scores (E/A C-Score), 
represented the “truer score” when they 
judged a competitive routine, compared to 
lower-level judges. 
We proposed a model with two key 
considerations: 1) the Superior Jury’s scores 
are considered more representative of the 
performance, and individual judges’ 
deviations from the overall judging panel 
define their performance level; and 2) the 
model is based on the pre-defined tolerances 
established by the FIG for judges’ reference 
(FIG, 2017). If the scores for a routine fall 
outside this pre-defined deviation among the 
control judges, they must re-judge the 
routine using video recordings. This process 
could yield scores closer to the "true score" 
and provide better feedback on judges’ 
performance. Control scores can only be 
adjusted if the deviation between scores 
exceeds the allowed tolerance. 
The "true score" is determined by the 
E/A C-Score, averaging three E/A C-scores: 
two expert judges’ scores from the Superior 
Jury, plus the Chair Judge’s score from the 
judging panel. All three expert scores for 
each E- and A-C score must fall within the 
allowed deviation. To ensure this, we used 
the coefficient of variation (CV), where CV 
= (Standard Deviation / Mean) * 100. The 
CV takes into account the weighting 
variable, as judges are generally more 
accurate when assessing higher-quality 
performances than lower-quality ones 
(Mercier & Heiniger, 2018). Since judging 
variation increases as scores decrease, the 
allowed inter-judge deviation thresholds 
increase with the number of deductions. 
This means that the same absolute 
deviation between judges results in a higher 
dispersion when lower penalties are applied. 
In our model, when the average total 
deductions are less than 1 (resulting in a 
score of 9 or higher), a higher CV doesn’t 
necessarily indicate high variability, and a 
more accurate measure of score variability 
can be obtained from the classification rate, 
based on total deductions from a maximum 
of 10 points. We established different 
acceptable CVs for each 0.5 deduction from 
10 points, all within the allowed deviation 
for each score range (Table 1). 
In the ACROAJIR® Excel tool, a 
"control scores validity macro" was 
implemented to process all mathematical 
calculations and quickly detect significant 
differences between control judges’ scores. 
When the difference between control judges 
exceeds a specific threshold, a video review 
becomes compulsory to redefine the true 
score accurately. 
 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                390                           Science of Gymnastics Journal 
 
 
Table 1. 
Examples of cases of control scores allowed (case A) and not allowed (case B), with the least differences between them, to check the acceptability 
of the control score as the "true score" for each range of scores. The grey boxes provide an example of a non-allowed control judge score, 
according to the allowed deviations for each range of scores. The same criteria could be applied to artistic scores. 
 
Routine range scores 10.0 to 9.5 9.499 to 9.0 8.999 to 8.5 8.499 to 8.0 7.999 to 7.5 7.499 to 6.5 
Maximum inter-score deviation 
allowed/range score 
0.1 0.2 0.3 0.4 0.5 0.6 
Case examples Case A Case B Case A Case B Case A Case B Case A Case B Case A Case B Case A Case B 
SJ-E1 score 9.8 9.8 9.3 9.3 8.9 8.9 8.5 8.5 7.9 7.9 7.4 7.4 
SJ-E2 score 9.8 9.8 9.3 9.3 8.9 8.9 8.5 8.5 7.9 7.9 7.4 7.4 
CJP-E3 score 9.7 9.6 9.1 9.0 8.6 8.5 8.1 8.0 7.4 7.3 6.9 6.8 
Average E control score 9.767 9.733 9.233 9.200 8.800 8.767 8.367 8.333 7.733 7.700 7.233 7.200 
Maximum inter-score deviation 0.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5 0.6 0.5 0.6 
Total SJ-E1penalties 0.2 0.2 0.7 0.7 1.1 1.1 1.5 1.5 2.1 2.1 2.6 2.6 
Total SJ-E2 penalties 0.2 0.2 0.7 0.7 1.1 1.1 1.5 1.5 2.1 2.1 2.6 2.6 
Total CJP-E3 penalties 0.3 0.4 0.9 1.0 1.4 1.5 1.9 2 2.6 2.7 3.1 3.2 
Average control E penalty 0.233 0.267 0.767 0.800 1.200 1.233 1.633 1.667 2.267 2.300 2.767 2.800 
relative change with only 0.1 point 
differences according to previous Case 
A (see shaded scores) 
 12.5%  4.2%  2.7%  2.0%  1.4%  1.2% 
Penalties CV (%) 24.7 43.3 15.1 21.7 14.4 18.7 14.1 17.3 12.7 15.1 10.4 12.4 
Maximum CV allowed for the Inter-
judges' deductions for each range (%) 
25 16.5 15 14.5 13.5 12.5 
Action required with regard to scores  Check  Check  Check  Check  Check  Check 
 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                391                           Science of Gymnastics Journal 
Each routine was judged on its 
execution (E) and artistic merit (A), 
evaluated by a randomized pool of judges. 
Accuracy was measured as the deviation of 
a judge's E- and A-scores from the respective 
E- and A-control scores. Bias (integrity) was 
assessed by examining the rankings assigned 
by a judge for the exercises in a single round 
and across the entire competition. 
Consistency was evaluated by identifying 
unusual changes in the standard of marks 
given for the exercises (FIG, 2020). Paired 
panel and control scores were used to assess 
score accuracy (quantitative) and association 
concordance (ranking), for quantitative and 
qualitative evaluation, respectively.Lin’s 
Concordance Correlation Coefficient 
(LCCC) was used to measure the accuracy 
or concordance between each judge's score 
(Y) and the “true score” provided by the 
Control score (X) to quantify the agreement 
between these two measures) for the same 
gymnastic routine (Akoglu, 2018; Lin, 1989; 
McBride, 2005). The LCCC formula was as 
follows: 
𝜌𝑐 =
2𝑠𝑥𝑦
𝑆𝑥2 + 𝑆𝑦2 + (?̅? − ?̅?)2
 
where 𝑆𝑥𝑦 is the covariance, 𝑆
2 is the 
variance and 𝑥 ̅and ?̅? are the means for x and 
y raters, 
 𝑆𝑥
2 =
1
𝑛
∑ (𝑥𝑖 − ?̅?)
2𝑛
𝑖=1 , 𝑆𝑦
2 =
1
𝑛
∑ (𝑦𝑖 −
𝑛
𝑖=1
?̅?)2 and 𝑆𝑥𝑦 =
1
𝑛
∑ (𝑥𝑖 − ?̅?)
𝑛
𝑖=1 (𝑦𝑖 − ?̅?) 
 
Strength-of-agreement criteria for Lin’s 
concordance correlations coefficient were 
proposed as follows: <0.99 Almost Perfect, 
0.95 to 0.99 Substantial, 0.90 to 0.95 
Moderate and <0.90 Poor (McBride, 2005). 
However, for real competition, an acceptable 
range of deviation between judges’ scores is 
defined in Table 2. 
This difference varies depending on the 
level of the competitive routine and is 
determined by the number of penalties 
awarded for technical and artistic errors. We 
assumed that the interpretation of correlation 
coefficients varies significantly across 
research areas. In gymnastics evaluation, 
due to potential inter-judge variability, 
particularly when penalties are greater, we 
proposed an interpretation closer to 
Altman's, suggesting that the strength-of-
agreement criteria for Lin's concordance 
should be aligned with other correlation 
coefficients, such as Pearson's, where < 0.2 
is considered poor and > 0.8 is excellent 
(Akoglu, 2018). 
For ACROAJIR®'s assessment of 
acrobatic gymnastics judges’ performance, 
we defined Lin's concordance qualitative 
ranking criteria as: < 0.95 Excellent; 0.8 to 
0.9499 Very Good; 0.7 to 0.7999 Good; 0.6 
to 0.6999 Satisfactory; 0.5 to 0.5999 Poor; 
and less than 0.5 Very Poor. Additionally, 
we needed to measure the extent to which 
judges rank gymnastics routines in the 
correct order. Concordance and accuracy are 
crucial, and while small inter-score 
differences may be acceptable, the most 
important factor is ensuring that the final 
ranking is fair. To calculate judges' integrity, 
we used the strength of association between 
the judge and control rankings for each 
routine, applying the Kendall Concordance 
Coefficient (W). 
Kendall’s W, which includes the 
presence of ties, was calculated as follows 
(Kendall & Babington-Smith, 1939; Wallis, 
1939): 
W =
12 ∙ S
m2(n3 − n) − m ∙ ∑ Tj
m
1
 
where m = number of raters, n = number of 
evaluated routines,  
𝑆 = ∑ (𝑅𝑖 − ?̅?)
2𝑛
𝑖=1 , 
being 𝑅𝑖 the sum of the ranges of the scores 
given by m evaluators to the ith subject and 
?̅? is the arithmetic mean of the Ri, i = 1,…, 
n. 
𝑇𝑗 = ∑ (𝑡𝑖
3 − 𝑡𝑖)
𝑔𝑗
𝑖=1
,  
assigns the average of the rankings to the tied 
observation, where t_i is the number of tied 
values in the i-th grouping of ties, and gj is 
the number of tie groups in the j-th set of 
hierarchies, j = 1, ..., m. 
Kendall’s W values lie between 0 and 1, 
where 0 indicates the absence of agreement, 
and 1 represents total agreement. A high 
Kendall’s W indicates that judges are likely 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                392                           Science of Gymnastics Journal 
to apply the same standards when evaluating 
the same competitive routines. As all 
switched ranking positions don’t have the 
same relevance, the ranking swap costs vary. 
We proposed different Kendall’s W 
reduction coefficients, according to the 
relevance of Judge and Control ranking 
positions being switched. When the 
relevance of the changed position increases, 
the coefficient that multiplies the value of 
Kendall's W decreases, and thus decreases 
the degree of agreement between judge and 
control. The different Kendall’s W reduction 
coefficients are defined as follows: 0.7, 0.6, 
0.8, 0.65, 0.82 and 0.75 when switching the 
1vs3, 1vs4, 2vs3, 2vs4, 3vs4 and 3vs5 or 
more ranked positions, respectively. 
To evaluate the qualitative ranking 
criteria for judges’ performance, the final 
Kendall’s W values were classified as 
follows: <0.95 excellent; 0.9 to 0.9499 very 
good; 0.8 to 0.8999 good; 0.7 to 0.7999 
satisfactory; and 0.6 or less very poor. 
These formulas were integrated into the 
ACROAJIR® spreadsheet, and a second 
macro function called "AJIR-macro" was 
developed. This macro used all the 
previously collected official E- and A- 
individual judges' scores, along with the 
revised control scores, to automatically and 
individually check all the predefined 
statistical and mathematical criteria. It was 
implemented to automatically analyze, 
generate, and export all the information 
presented in each individual report. 
For each competitor and competition 
session, the report provides information 
about the execution or artistic score, the 
ranking assigned by each judge, and its 
relationship to the Control-and-Panel score 
and ranking, presented both numerically and 
graphically. If a judge's score deviation for a 
particular country exceeds the limit allowed 
by the FIG, a yellow alert is automatically 
displayed under the affected country in the 
score graph. If this difference impacts the 
rankings according to the criteria outlined in 
Table 1 and Table 2, the same yellow alert 
principle is applied. 
The bias score compares a judge’s score 
for their own country with the equivalent 
control score. If the judge-vs-control score 
or ranking deviation is more favorable to the 
judge's country than the defined allowable 
deviation, the score or ranking bias box will 
display a red alert. A quantitative and 
qualitative individual score and ranking 
evaluation was also included, using LCCC 
and Kendall's W values, to provide quick 
and understandable feedback on judges' 
performance. 
Finally, the report presents a summary 
of all four E/A judges' panel evaluations. 
The ACROAJIR® "AJIR-macro" processes 
all the statistical and mathematical data to 
create and name each individual report. All 
data analysis was performed using an Excel 
spreadsheet (Microsoft, version 365-2019, 
US). 
We collected data from 76 experienced 
international acrobatic gymnastics judges, 
who officiated at four official events during 
the 2017–2022 Olympic cycle: the 10th and 
11th European Age Group Acrobatic 
Gymnastics Competitions (EAGC) and the 
29th and 30th European Acrobatic 
Gymnastics Championships (ECh) held in 
2019 and 2021, respectively. 
To evaluate whether daily feedback on 
each judge’s results improved their 
subsequent performance (in terms of 
accuracy and agreement with control scores) 
as the event progressed, each competition 
was divided into two parts. The first part was 
completed when either all judges had judged 
at least once or half of the competitive 
session had been finished. The second part 
encompassed all remaining competitive 
sessions. Judges evaluated routines 
approximately 3.25 ± 0.7 times in the EAGC 
and 4.05 ± 0.8 times in the ECh during each 
part. Each judge was evaluated at least once 
in each part. 
 
 
 
 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                393                           Science of Gymnastics Journal 
Table 2. 
Score and Ranking evaluation criteria defined between judges and control rankings. Higher 
differences generate a red or yellow highlighted alert. 
Score evaluation criteria  
Control score between Allowed deviation 
min max judge vs Control 
9.5 10.00 0.1 
8.7 9.499 0.2 
8.0 8.699 0.3 
7.0 7.999 0,4 
6.0 6.999 0.5 
5.0 5.999 0.7 
0.0 4.999 1.0 
 
Ranking evaluation criteria 
Ranking positions 
intervals  
Ranking differences between  
control and judge's rRanking 
1st and 2nd  0 or 1 
If the control score between 1st and 2nd place is greater than or equal to 
0.1 point, then the difference in ranking with the control scores can be 
1 place. 
3rd and 4th 1 
5 to 8th 2 
9 to 12th 3 
12th or more 4 
 
 
Only competitive sessions with 6 or 
more competitors were used to assess judges 
to avoid small differences in scores causing 
large disparities in rankings and potentially 
resulting in unfair evaluations. With 6 or 
more competitors, the validity of the judges' 
evaluations improves. Since the final 
competition in the second part of EAGC 
events could only be assessed by higher-
category judges, which might act as a 
confounding variable, we only included the 
qualification routines for EAGC. For the 
ECh event, both qualification and final 
competitive routines were included. 
The intervention was designed to 
minimize significant inconsistencies in 
judging from one day or group to the next. 
Such inconsistencies were largely reduced, 
except for individual finals at ECh (balance 
or dynamic exercises). In these cases, it 
would require that the same judge be 
selected for the same role after a random 
draw. It is impossible for a judge to act in the 
same role for the same routine they had 
judged in qualifications at the EAGC, and it 
is limited to a pool of a few high-category 
judges at the ECh. 
The independent variable was the 
performance in two parts of each 
competition event, while the dependent 
variables were changes in score accuracy 
and ranking concordance. Individual reports 
were sent after the completion of the 10th 
EAGC and 29th ECh events, without daily 
feedback conditions (NFBC). In contrast, for 
the 11th EAGC and 30th ECh events, 
individual reports were provided daily, 
within a maximum of 12 hours after the end 
of each competition day and before the next 
day’s session began, under daily feedback 
conditions (FBC). We compared a total of 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                394                           Science of Gymnastics Journal 
953 reports: 272 from the EAGC and 680 
from the ECh competitions. 
Daily, after each competition, the 
control jury received all scores and validated 
their own accuracy in judging. The control 
scores validity macro quickly identified any 
significant differences between individual 
control judges' scores for all evaluated 
sessions. If significant differences were 
detected, the affected competitive routine 
was re-judged using video recordings at the 
end of each day’s last competitive session to 
provide a more reliable true score within the 
defined deviation. 
Once all control judges' scores were 
finalized, paired judge-and-control scores 
were obtained for accuracy (scores) and 
concordance (ranking) using the AJIR 
macro, which generated an individual report 
for each judge. A total of 1280 reports were 
created and sent confidentially. The 
computer used for this analysis was a 
Microsoft Surface Pro 7, 12.3" (Intel Core 
i5-1035G4, 8GB RAM, 256GB SSD, 
Microsoft, Redmond, USA). To analyze the 
effects of judging performance, we 
compared judges’ performances between the 
first and second parts of each event, noting 
that daily evaluation reports were provided 
only for the 11th EAGC and 30th ECh 
events. 
Standard statistical methods were used 
to calculate means and confidence intervals 
for accuracy and consistency, as previously 
defined. The Kolmogorov-Smirnov and 
Levene tests assessed normality and 
homogeneity of sample distributions. Data 
were analyzed using parametric or non-
parametric tests based on these results. 
Since each judging panel was drawn 
randomly, an unpaired t-test was used to 
evaluate the effects of prospective judging 
quality between the first and second parts of 
each event. Significance was set 
at P≤0.05P≤0.05. All analyses were 
conducted using SPSS software version 23.0 
(SPSS, Chicago, IL). 
 
 
 
RESULTS 
 
Figure 1 illustrates the effects of 
prospective judging performance between 
the first and second parts of each event. 
Inter-judge performance was significantly 
higher in the 2021 (FBC) compared to the 
2019 (NFBC) European ACRO events, with 
score accuracy improving from 0.75 ± 0.14 
to 0.78 ± 0.15 (p = 0.044) and ranking 
concordance improving from 0.80 ± 0.13 to 
0.82 ± 0.14 (p = 0.007). Judges' ranking 
concordance significantly improved when 
daily evaluations were provided (0.82 ± 0.13 
vs 0.77 ± 0.19; p = 0.000), while score 
accuracy improved but not significantly 
(0.75 ± 0.16 vs 0.76 ± 0.17; p = 0.305). 
Within events, judges' overall accuracy 
was significantly better in qualification 
competitions at the 11th EAGC compared to 
the 10th EAGC (0.76 ± 0.11 vs 0.80 ± 0.11; 
p = 0.007), with 160 vs 192 AJIRs, 
respectively. Judges' performance in ranking 
concordance significantly improved in the 
30th ECh compared to the 29th ECh (0.76 ± 
0.21 vs 0.82 ± 0.14; p = 0.007), with 368 and 
320 AJIRs, respectively. 
Comparing judging of execution and 
artistic performance in the first and second 
parts of competition events, the 10th EAGC 
(NFBC) showed a significant reduction in 
score accuracy differences for execution in 
the second part (0.80 ± 0.073 vs 0.72 ± 0.18; 
p = 0.013). Although judges’ ranking 
concordance was lower in the second part 
(0.84 ± 0.10 vs 0.82 ± 0.13; p = 0.446), the 
difference was not significant. For artistic 
performance, there was no significant 
reduction in score accuracy differences in 
the second part (0.74 ± 0.10 vs 0.70 ± 0.15; 
p = 0.221), but there was a significant 
reduction in ranking concordance 
differences (0.77 ± 0.13 vs 0.70 ± 0.06; p = 
0.039). 
For the 29th ECh (NFBC), no 
significant differences in judges' accuracy 
for execution and artistic performance were 
found between the first and second parts of 
the competition. However, there was a 
significant reduction in judges’ ranking 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                395                           Science of Gymnastics Journal 
concordance in the second part for execution 
(0.83 ± 0.15 vs 0.75 ± 0.27; p = 0.027) and 
artistic performance (0.76 ± 0.12 vs 0.71 ± 
0.25; p = 0.043). 
In the FBC, at the 11th EAGC, judges’ 
concordance for both artistic and execution 
scores improved in the second part of the 
competition for accuracy and ranking, with 
significant differences observed only for 
execution accuracy (0.83 ± 0.13 vs 0.70 ± 
0.07; p = 0.017). At the 30th ECh (FBC), the 
only significant increase in judges’ score 
accuracy was for artistic performance (0.63 
± 0.23 vs 0.73 ± 0.18; p = 0.005). 
A total of 1280 individual reports were 
created and analyzed at the end of each 
competition day, but only 640 were sent 
confidentially for the 2021 events. For easier 
understanding by the judges, the accuracy 
and concordance values in the individual 
reports were multiplied by 100 (Figure 2). 
 
 
 
A 
272 reports for Qualifications sessions with 6 
or more competitive units 
B 
680 reports for Qualifications and Finals 
sessions with 6 or more competitive units 
 
Figure 1. Mean and 95% confidence intervals for score accuracy (bold line) and ranking (dashed 
line) judges’ evaluations for artistic performance or execution in the first and second parts of 
EAGC (A) and ECh (B) events (* p<0.05 score significant differences; # p<0.05 ranking 
significant differences). 
 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                396                           Science of Gymnastics Journal 
 
Figure 2. An example of individual reports in one competitive session. Session and judge data are hidden to protect confidentiality. (Wide bright 
line: Panel score; Thin line with filled circles: Control score; Bold line with empty circles: Judge score  
9,5 
9,0 
8,5 
8,0 
7,5 
7,0 
 Warning for SCORE 
Bias for SCORE 
19 
17 
15 
13 
11 
9 
7 
5 
3 
1 
 Warning for RANKING 
 E PANEL Evaluation 
The accuracy and consistency of your judging were BIAS  The accuracy and consistency of your judging were 
SCORE 72% Good YES SCORE 77% Good 
RANKING 84% Good NO RANKING 78% Satisfactory 
Score 8,45 8,55 8,70 8,90 9,05 8,65 8,25 8,10 8,35 8,90 9,05 8,55 8,30 7,85 8,20 8,75 8,50 8,75 
Control-SCORE Execution 8,60 8,60 8,70 9,00 8,70 8,50 8,30 8,00 8,30 8,80 9,00 8,50 8,50 8,00 8,20 9,00 8,60 8,80 
E1 NOC Judge E Name 8,30 8,60 8,70 9,30 9,20 8,50 8,10 8,50 8,30 8,90 9,10 8,10 8,20 8,30 8,30 8,80 8,60 8,80 
Ranking E Score 12 10 7 3 1 8 15 17 13 3 1 9 14 18 16 5 11 5 
Control Execution Ranking 8 8 6 1 6 11 14 17 14 4 1 11 11 17 16 1 8 4 
E1 NOC Judge Name 12 8 7 1 2 10 17 10 12 4 3 17 16 12 12 5 8 5 
SCORE
RANKING
 E1 NOC Judge Name Evaluation 
MDA       ESP         GBR        NOC        ESP        POR2        FRA        GEO       SUI         POR1        RUS       AZE2        IRL         BUL        POL        AZE1      HUN        UKR
MDA       ESP         GBR        NOC        ESP        POR2        FRA        GEO       SUI         POR1        RUS       AZE2        IRL         BUL        POL        AZE1      HUN        UKR
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                397                           Science of Gymnastics Journal 
DISCUSSION 
 
The aim of this study was to design and 
apply the ACROAJIR® tool to control and 
evaluate ACRO judges’ performance and 
prospective judging effects at major 
competitive events. The tool had practical 
applications in three domains: a) a specific 
Technical Committee (TC) control task; b) 
feedback for individual judges; and c) 
assessment of prospective judging effects. 
Fulfilling a specific TC control task, the 
ACROAJIR® results provided the 
Technical Committee (TC) with an overview 
of results and specific, objective information 
about judges' performance in each 
competitive session. This information 
facilitated easy, quick, and accurate 
identification of individual judges’ mistakes. 
It offered strong evidence for managing 
these mistakes, supporting correct decision-
making, and alerting judges to potential 
future issues. TC members received 
objective data on the accuracy, concordance, 
and bias of judges' scores.  
Performance feedback is commonly 
used to influence behavior, and providing 
information about past performance is a 
widely adopted strategy in competitions. 
However, the effects of daily performance 
feedback have not been previously 
evaluated. This quantitative study analyzed 
how daily feedback, supported by the AJIRs-
based formative assessment process, 
affected judges’ accuracy and concordance. 
Each judge received a simple daily report on 
their performance in terms of accuracy and 
concordance at the 2021 competition events, 
and those with the best scores were 
congratulated. Knowledge of the control 
score enhanced judges’ self-confidence and 
consistency by providing an objective 
assessment of their accuracy, consistency, 
and concordance relative to the control 
score. None of the judges disagreed with the 
reports received, and all appreciated the 
daily feedback effort. No prior studies with 
similar designs were found. 
Knowing that they would be evaluated 
daily appeared to motivate judges to 
consistently perform their best. The effects 
were differentiated based on whether 
feedback was provided daily or not, 
particularly for characteristics such as score 
accuracy and concordance. 
Feedback included comparisons with 
benchmarks beyond the in-competition 
trimmed mean. Overall, judges’ 
performance significantly improved when 
they were aware of daily evaluations. 
Specifically, judges’ score accuracy was 
significantly better with daily reports at the 
11th EAGC, and ranking concordance was 
better at the 30th ECh. While judges knew 
they would be evaluated at the start of each 
competition, they did not anticipate 
receiving daily reports. 
Overall, judges’ performance was 
significantly worse in the second part of 
competitions where no feedback was given, 
compared to when they received daily 
feedback, which either maintained or 
improved performance. Significant declines 
were observed in execution accuracy scores 
and artistic ranking concordance at the 10th 
EAGC (NFBC) and in artistic ranking 
concordance at the 29th ECh (NFBC) during 
the second parts of these events. 
In contrast, with daily feedback 
conditions (FBC), both accuracy and 
ranking concordance improved in the second 
part of the competition for both artistic 
performance and execution at the 11th 
EAGC. At the 30th ECh, judges’ score 
accuracy for artistic performance was 
significantly higher. 
Previous studies examining judges in 
gymnastics, judo, rope climbing, and 
synchronized swimming found that when 
judges received open feedback (i.e., the 
ability to hear or see their colleagues' scores 
after each performance), the variation 
between scores was significantly lower. This 
suggests that conformity was influenced by 
informational factors (Auweele et al., 2004; 
Boen et al., 2006, 2008, 2013), which 
supports our findings. 
However, reference panel scores can 
sometimes be incorrect, potentially 
influencing a judge's decisions. This 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                398                           Science of Gymnastics Journal 
normative conformity bias can be dangerous 
and lead to unfair results. Even if a judge’s 
score is accurate, consistently aligning with 
the panel's score when deviations occur can 
compromise judgment, leading to normative 
conformity bias. Daily feedback on control 
scores mitigates this risk by boosting judges' 
confidence in their own judgments, thereby 
motivating better performance in future 
sessions. 
In summary, judges' performance either 
remained stable or improved when they were 
consistently updated on their performance. 
Overall, higher score accuracy was 
associated with greater ranking 
concordance. However, when routines were 
at a similar level, small changes in score 
accuracy led to significant changes in 
ranking concordance. This assessment 
proved valuable for detecting instances 
where judges might exploit small but 
permissible scoring gaps to favor their own 
countries. It also provided feedback on 
judges' scoring patterns, which could be 
useful for training and accrediting judges. 
Updates and feedback can help propose 
corrective measures for judges who perform 
below expectations (Mercier & Heiniger, 
2018). 
Judges aim to perform at their best, and 
the results demonstrated a high level of 
quality overall. Although a consistently high 
performance might limit improvements as 
the event progresses, knowledge of daily 
evaluations during the 2021 events led to 
significantly more accurate scores as the 
competition continued. This article 
introduces a novel approach to evaluating 
judges' performance during live 
competitions. To our knowledge, providing 
individual written feedback reports during 
competitions has not been previously 
implemented. This method suggests new 
active methodologies and formative 
evaluations for future use. 
The current study had several 
limitations. First, the use of expert superior 
jury scores as ‘true’ scores introduces 
potential issues, as these expert scores might 
also be inaccurate or not align with the 
judging panel. This could affect the 
evaluation of judges during live 
competitions. The ranking swap costs 
defined in this study might be better 
represented by more sophisticated 
regression equations to explain all relevant 
ranking swap cases. Additionally, refining 
the definitions for the first and second 
periods and using the tool solely for 
pedagogical purposes, without sanctions for 
biased or incorrect judgments, could impact 
the number of significant differences 
observed in judges’ performance as the 
events progressed. Although post-feedback 
improvements in accuracy were noted, 
understanding the process behind this 
alignment would provide insights into the 
cause of discrepancies. Future research 
should include more examples to validate 
the findings of this study. With more 
comprehensive evidence, further actions can 
be taken to enhance the rating system for the 
discipline (Anderlucci et al., 2020). 
 
CONCLUSION 
 
The ACROAJIR® tool offered timely, 
valuable, and personalized feedback on 
accuracy and concordance scores for 
acrobatic gymnastics judges during 
competitions. It demonstrates that such 
feedback can be effectively delivered 
during, rather than only after, competition 
events. The tool facilitates specific TC 
scoring control tasks, provides judges with 
evidence-based feedback, and suggests 
targeted improvements for prospective 
judging. 
 
REFERENCES 
 
Akoglu, H. (2018). User’s guide to 
correlation coefficients. Turkish Journal of 
Emergency Medicine, 18(3), 91–93. 
https://doi.org/10.1016/j.tjem.2018.08.001 
Anderlucci, L., Lubisco, A., & Mignani, 
S. (2020). Investigating the Judges 
Performance in a National Competition of 
Sport Dance. Social Indicators Research. 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                399                           Science of Gymnastics Journal 
https://doi.org/10.1007/s11205-019-02256-
z 
Auweele, Y. Vanden, Boen, F., De 
Geest, A., & Feys, J. (2004). Judging bias in 
synchronized swimming: Open feedback 
leads to nonperformance-based conformity. 
Journal of Sport and Exercise Psychology, 
26(4), 561–571. 
https://doi.org/10.1123/jsep.26.4.561 
Boen, F., Ginis, P., & Smits, T. (2013). 
Judges in judo conform to the referee 
because of the reactive feedback system. 
European Journal of Sport Science, 13(6), 
599–604. 
https://doi.org/10.1080/17461391.2012.756
070 
Boen, F., van Hoye, K., Auweele, Y. 
Vanden, Feys, J., & Smits, T. (2008). Open 
feedback in gymnastic judging causes 
conformity bias based on informational 
influencing. Journal of Sports Sciences, 
26(6), 621–628. 
https://doi.org/10.1080/0264041070167039
3 
Boen, F., Vanden Auweele, Y., Claes, 
E., Feys, J., & De Cuyper, B. (2006). The 
impact of open feedback on conformity 
among judges in rope skipping. Psychology 
of Sport and Exercise, 7(6), 577–590. 
https://doi.org/10.1016/j.psychsport.2005.1
2.001 
Bučar, M., Čuk, I., Pajek, J., Karacsony, 
I., & Leskošek, B. (2012). Reliability and 
validity of judging in women’s artistic 
gymnastics at University Games 2009. 
European Journal of Sport Science, 12(3), 
207–215. 
https://doi.org/10.1080/17461391.2010.551
416 
Dallas, G., Mavidis, A., & 
Chairopoulou, C. (2011). Influence of angle 
of view on judges’ evaluations of inverted 
cross in men’s rings. Perceptual and Motor 
Skills, 112(1), 109–121. 
https://doi.org/10.2466/05.22.24.27.PMS.11
2.1.109-121 
Damisch, L., Mussweiler, T., & 
Plessner, H. (2006). Olympic medals as 
fruits of comparison? Assimilation and 
contrast in sequential performance 
judgments. Journal of Experimental 
Psychology: Applied, 12(3), 166–178. 
https://doi.org/10.1037/1076-
898X.12.3.166 
De Bruin, W. B. (2005). Save the last 
dance for me: Unwanted serial position 
effects in jury evaluations. Acta 
psychologica, 118(3), 245-260. 
https://doi.org/10.1016/j.actpsy.2004.08.005 
Dosseville, F., Laborde, S., & 
Garncarzyk, C. (2014). Current research in 
sports officiating and decision-making. In C. 
Mohiyeddini (Ed.), Contemporary topics 
and trends in the psychology of sports 
(pp.13–38). Nova Publishers. 
Fasold, F., Memmert, D., & Unkelbach, 
C. (2012). Extreme judgments depend on the 
expectation of following judgments: A 
calibration analysis. Psychology of Sport 
and Exercise, 13(2), 197-200. 
https://doi.org/10.1016/j.psychsport.2011.11
.004 
FIG. (2015). Regulations for the judges’ 
evaluation programme (JEP) “former 
fairbrother system” and its’ application. 
Federation International de Gymnastique. 
FIG. (2020). 2017- 2020 Fig Judges’ 
Rules Specific Rules for Acrobatic. In 
Specific Rules for Acrobatic Gymnastics. 
Federation International de Gymnastique. 
FIG. (2017). Appendix to the Codes of 
Points (COP). Federation International de 
Gymnastique.  
Flessas, K., Mylonas, D., 
Panagiotaropoulou, G., Tsopani, D., Korda, 
A., Siettos, C., Di Cagno, A., Evdokimidis, 
I., & Smyrnis, N. (2015). Judging the judges’ 
performance in rhythmic gymnastics. 
Medicine and Science in Sports and 
Exercise, 47(3), 640–648. 
https://doi.org/10.1249/MSS.000000000000
0425 
Gambarelli, G., Iaquinta, G., & Piazza, 
M. (2012). Anti-collusion indices and 
averages for the evaluation of performances 
and judges. Journal of Sports Sciences, 
30(4), 411–417. 
https://doi.org/10.1080/02640414.2011.651
153 
León-Prados & Ramos: DAILY EVALUATION FOR ACROBATIC GYMNASTICS JUDGES          Vol. 16, Issue 3: 387-400 
Science of Gymnastics Journal                                400                           Science of Gymnastics Journal 
Heiniger, S., & Mercier, H. (2018). 
National Bias of International Gymnastics 
Judges during the 2013-2016 Olympic 
Cycle. http://arxiv.org/abs/1807.10033 
Heiniger, S., & Mercier, H. (2021). 
Judging the judges: evaluating the accuracy 
and national bias of international gymnastics 
judges. Journal of Quantitative Analysis in 
Sports, 17(4), 289-305. 
https://doi.org/10.1515/jqas-2019-0113 
Kramer RSS. (2017) Sequential effects 
in Olympic synchronized diving scores. 
Royal Society Open science. 4: 160812. 
http://dx.doi.org/10.1098/rsos.160812 
Kendall, M. G., & Babington-Smith, B. 
(1939). The Problem of m Rankings. The 
Annals of Mathematical Statistics, 10(3), 
275–287. 
https://doi.org/http://dx.doi.org/10.1214/ao
ms/1177732186 
León-Prados, J. A., & Jemni, M. (2022). 
Reliability and agreement in technical and 
artistic scores during real-time judging in 
two European acrobatic gymnastic events. 
International Journal of Performance 
Analysis in Sport, 22(1), 132–148. 
https://doi.org/10.1080/24748668.2021.199
6913 
León-Prados, J. A., & Rosales, A. 
(2019). ACROAJIR®; Automatic ACRO 
Judges Individual Report Tool. Universidad 
Pablo de Olavide. 
Leskošek, B., Čuk, I., & Peixoto, C. J. 
D. (2018). Inter-rater reliability and validity 
of scoring men’s individual trampoline 
routines at European championships 2014. 
Science of Gymnastics Journal, 10(1), 69–
79. 
Lin, L. I. (1989). A concordance 
correlation coefficient to evaluate 
reproducibility. Biometrics, 45(1), 255–265. 
McBride, G. (2005). A proposal for 
strength-of-agreement criteria for Lin’s 
Concordance Correlation Coefficient. NIWA 
Client Report, HAM2005-062 
https://www.medcalc.org/download/pdf/Mc
Bride2005.pdf 
Mercier, H., & Heiniger, S. (2018). 
Judging the Judges: Evaluating the 
Performance of International Gymnastics 
Judges. http://arxiv.org/abs/1807.10021 
Mercier, H., & Klahn, C. (2017). 
Judging the judges : Evaluating the 
performance of international gymnastics 
judges. MIT Sloan Sports Analytics. 
https://arxiv.org/pdf/1807.10021.pdf 
Plessner, H., & Schallies, E. (2005). 
Judging the cross on rings: A matter of 
achieving shape constancy. Applied 
Cognitive Psychology, 19(9), 1145–1156. 
https://doi.org/10.1002/acp.1136 
Premelč, J., Vučković, G., James, N., & 
Leskošek, B. (2019). Reliability of judging 
in DanceSport. Frontiers in Psychology, 10. 
1001. 
https://doi.org/10.3389/fpsyg.2019.01001 
Ste-Marie, D. M. (2000). Expertise in 
women’s gymnastics judging: an 
observational approach. Perceptual and 
Motor Skills, 90(2), 543–546. 
https://doi.org/10.2466/pms.2000.90.2.543 
Van Bokhorst, L. G., Knapová, L., 
Majoranc, K., Szebeni, Z. K., Táborský, A., 
Tomić, D., & Cañadas, E. (2016). “It’s 
always the judge’s fault”: Attention, emotion 
recognition, and expertise in rhythmic 
gymnastics assessment. Frontiers in 
Psychology, 7(JUL). 
https://doi.org/10.3389/fpsyg.2016.01008 
Wallis, W. A. (1939). The Correlation 
Ratio for Ranked Data. Journal of the 
American Statistical Association, 34(207), 
533–538. 
https://doi.org/10.1080/01621459.1939.105
03552 
 
Corresponding author: 
 
Juan Antonio León-Prados, 
Physical Performance & Sport Research 
Center, Pablo de Olavide University,  
Pablo de Olavide’ University, Carretera de 
Utrera km 1 
41013, Seville, Spain, 
phone: (+34) 606701338, 
e-mail: jaleopra@upo.es  
 
Article received: 3.12.2023 
Article accepted: 2.4.2024