REVIJA ZA ELEMENTARNO IZOBRAŽEVANJE 
JOURNAL OF ELEMENTARY EDUCATION 
Vol. 18, No. 1, pp. 107–124, March 2025 
 
 
 
 
 NEW POLYSTOCHASTIC STATISTICAL INFERENCE IN 
SOCIAL SCIENCES - 
DEFINING NEW RULES AND THRESHOLDS 
  
Potrjeno/Accepted  
25. 2. 2025 
 
Objavljeno/Published 
31. 3. 2025 
SINIŠA OPIĆ 
Faculty of Teacher Education, University of Zagreb, Zagreb, Croatia 
 
KORESPONDENČNI AVTOR/CORRESPONDING AUTHOR 
sinisa.opic@ufzg.hr 
 
Keywords:  
Bayesian, effect size, 
NHST, p-value, 
polystochastic, social 
science, statistical 
inference. 
 
 
 
Ključne besede:  
Bayesov sklep, velikost 
učinka, NHST, p-
vrednost, 
polistohastično, 
družboslovje, statistično 
sklepanje. 
 
UDK/UDC:  
303:311.21 
Abstract/Izvleček 
The Null Hypothesis Significance Testing (NHST) framework has sparked 
considerable debate within the scientific community, leading to numerous studies 
advocating for a re-evaluation of the current system. New polystochastic statistical 
inference defines methods of statistical inference that integrate rules and 
thresholds for both rejecting the null hypothesis and confirming the alternative 
hypothesis. This approach unifies the control of respondents' influence on 
statistical significance and introduces criteria such as effect size and Bayesian 
inference for confirming the alternative hypothesis. Unlike NHST, polystochastic 
statistical inference controls Type I error (p-value) and aims to optimize the 
confirmation of evidence without increasing the risk of Type II errors. 
 
Novo polistohastično statistično sklepanje v družboslovju –  
Določitev novih pravil in mejnih vrednosti 
Okvir testiranja pomembnosti ničelne hipoteze (angl. Null Hypothesis 
Significance Testing – NHST) je sprožil precejšnjo razpravo v znanstveni 
skupnosti. To je vodilo do številnih študij, ki zagovarjajo ponovno oceno 
sedanjega sistema. Novo polistohastično statistično sklepanje definira metode 
statističnega sklepanja, ki združujejo pravila in pragove tako za zavračanje ničelne 
hipoteze kot za potrditev alternativne hipoteze. Ta pristop poenoti nadzor nad 
vplivom anketirancev na statistično pomembnost in uvede merila, kot sta velikost 
učinka in Bayesov sklep za potrditev alternativne hipoteze. Za razliko od NHST 
polistohastično statistično sklepanje nadzoruje napako tipa I (p-vrednost) in želi 
optimizirati potrditev dokazov brez povečanja tveganja napak tipa II. 
 
DOI https://doi.org/10.18690/rei.4907 
Besedilo / Text © 2025 Avtor(ji) / The Author(s) 
To delo je objavljeno pod licenco Creative Commons CC BY Priznanje avtorstva 4.0 
Mednarodna. Uporabnikom je dovoljeno tako nekomercialno kot tudi komercialno 
reproduciranje, distribuiranje, dajanje v najem, javna priobčitev in predelava avtorskega dela, pod 
pogojem, da navedejo avtorja izvirnega dela. (https://creativecommons.org/licenses/by/4.0/). 
 
 
108 
REVIJA ZA ELEMENTARNO IZOBRAŽEVANJE 
JOURNAL OF ELEMENTARY EDUCATION 
 
 
 
Introduction: A 100-Year-Old Problem (Fisher 1925 - today) 
 
Statistical significance has been a topic of intense debate in many scientific disciplines 
for a long time, particularly regarding its proper use and potential misuse. According to 
Rovetta (2024), it is one of the most controversial issues in contemporary science. The 
binary choice between statistically significant and insignificant results not only reflects 
a mathematical error but also fails to capture the complexity of statistical methods 
needed to communicate findings to the public, especially in fields like healthcare. This 
issue is not limited to medical research; it also affects most other scientific fields. Social 
sciences face significant challenges in statistical inference, which are compounded by 
the complexity of the phenomena being studied. Factors such as latent variables, issues 
of causality, inappropriate scales  
for statistical analysis (parametric tests), implausibility, incoherence, hard-to-control 
extraneous factors, lack of objectivity, and reliability problems all contribute to these 
challenges. As one author notes, any scientific discipline that grapples with such 
challenges will achieve long-lasting relevance. To put this issue in historical context, the 
concept of statistical significance was first introduced by Ronald Fisher in 1925. 
In fact, the concept began earlier with the work of Francis Edgeworth (1845–1926), 
who created a procedure for testing two arithmetic means (subsamples) that was later 
extended by Pearson to the Chi-square test (Pearson, 1900).  Edgeworth’s pioneering 
contribution lies at the beginnings of the development of statistical inference in testing 
arithmetic means and specificities such as skewness and kurtosis. Ronald Fisher further 
advanced these ideas in 1925, laying the groundwork for hypothesis testing in inferential 
statistics. His influential work, Statistical Methods for Research Workers, helped define the 
concept of statistical significance (p-value) as we understand it today. 
The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this 
point as a limit in judging whether a deviation ought to be considered significant or not. 
Deviations exceeding twice the standard deviation are thus formally regarded as 
significant. Using this criterion, we should be led to follow up a false indication only 
once in 22 trials, even if the statistics were the only guide available. Small effects will 
still escape notice if the data are insufficiently numerous to bring them out, but no 
lowering of the standard of significance would meet this difficulty.  (Fisher, 1925, 45) 
Fisher’s work laid the foundation for inferential statistics and initiated the field of 
hypothesis testing. Later, Newman and Pearson built upon Fisher’s methods, 
introducing the concepts of Type I error (rejecting the null hypothesis, H0, when it is 
true) and Type II error (failing to reject H0 when it is false) (Perezgonzalez, 2015). 
S. Opićć New Polystochastic Statistical Inference in Social Sciences - 
Defining new Rules and Thresholds 
109
.
 
 
 
The contributions of Newman and Pearson are significant, particularly in the context 
of enhancing statistical power for repeated sampling while considering Type I and II 
errors, as well as effect size (Holmberg, 2024). However, a notable drawback of the 
Newman and Pearson approach is its rigidity; it relies on a series of predetermined steps 
and lacks the flexibility found in Fisher’s method. McShane et al. (2019) emphasize the 
need to abandon the NHST approach (null hypothesis significance testing) in all areas 
of scientific activity in the biomedical and social sciences, i.e. they offer a broader 
concept (but one that is unclear): “Results need not first have a p-value or some other 
purely statistical measure that attains some threshold before consideration is given to 
the currently subordinate factors. Instead, treated continuously, statistical measures 
should be considered along with the currently subordinate factors as just one among 
many pieces of evidence and should not take priority thereby yielding a more holistic 
view of the evidence” (p. 25). 
Although the p-value is considered the “scientific default” in inferential statistics, it is 
frequently misused and misinterpreted. Many papers in the literature emphasize the 
need to redefine p-values, supplement them with new methods, or even abolish them 
completely, leading to confusion across various scientific disciplines. Additionally, some 
scientific journals discourage the use of p-values. Considering the ongoing concerns 
surrounding statistical significance and p-values, the American Statistical Association 
(ASA) published the Statement on Statistical Significance and P-Values. This document 
includes several important statements, as noted by Wasserstein and Lazar (2016): 
1. P-values can indicate how incompatible the data are with a specified statistical 
model. 
2. P-values do not measure the probability that the studied hypothesis is true, or 
the probability that the data were produced by random chance alone. 
3. Scientific conclusions and business or policy decisions should not be based only 
on whether a p-value passes a specific threshold. 
4. Proper inference requires full reporting and transparency p-value; debate. 
5. A p-value, or statistical significance, does not measure the size of an effect or 
the importance of a result. 
6. By itself, a p-value does not provide a good measure of evidence regarding a 
model or hypothesis. 
Consequently, the ASA presents a significant challenge in the realm of inferential 
statistics and clearly defines the meaning of the p-value, offering a more comprehensive 
approach to statistical inference. 
110 
REVIJA ZA ELEMENTARNO IZOBRAŽEVANJE 
JOURNAL OF ELEMENTARY EDUCATION 
 
 
 
The conclusion suggests the incorporation of new methods “but they may more directly 
address the size of an effect (and its associated uncertainty) or whether the hypothesis 
is correct” (Ibid. p. 132). 
The limit of statistical significance, or the null hypothesis, has been a topic of 
considerable debate in the literature for many years. Opinions range from calls for its 
complete abolition to suggestions like those from Benjamin et al. (2017), who propose 
reducing the threshold from 0.05 to 0.005. According to the authors, this change would 
be a significant step forward that could enhance reproducibility in research. They 
emphasize that this redefinition of the p-value, with its new standard, pertains 
specifically to research records and not to scientific publications. The aim is to observe 
how scientists behave under stricter criteria. I believe such a dual approach is 
unnecessary since we often select levels of statistical significance (0.05 and 0.01) for 
hypothesis testing in the social sciences (inferential statistics). Of course, the 0.005 level 
is quite stringent, and the authors highlight this viewpoint in their paper. They justify 
their approach in relation to Bayesian analysis, as it corresponds to Bayes factors ranging 
from approximately 14 to 26 in favour of the alternative hypothesis (H1). 
Lakens et al. (2018) suggest that instead of lowering the significance threshold from 
0.05 to 0.0005, researchers should abandon the term “statistical significance” altogether. 
They recommend that scientists focus on controlling error rates with an alpha level that 
is determined by the researcher. Similarly, de Ruiter (2019), while critiquing the proposal 
to lower the significance level to 0.005, argues that setting an alpha level of p ≤ 0.005 
does not enhance replicability. He believes that the rationale for adopting a new alpha 
level of 0.005 is weak and that such a change could potentially harm scientific practice. 
I agree with the recommendations made by the ASA, emphasizing that it is the 
responsibility of educators to ensure that scientists understand the term “statistically 
significant.” However, without clear criteria, we risk entering a realm of scientific 
“sfumato”, a form of voluntarism lacking defined standards. The p-value criterion (p < 
0.05) is inadequate because it is influenced by sample size and fails to accurately reflect 
the true magnitude of differences or relationships (e. g. in subsamples). Moreover, 
rejecting the null hypothesis does not provide evidence for confirming the alternative 
hypothesis. Null Hypothesis Significance Testing (NHST) is a statistical procedure that 
involves establishing a null hypothesis, generating data related to it, and assessing how 
much the outcome disagrees with the null hypothesis, using statistical estimates. 
While it may be a questionable criterion, having some standard is certainly better than 
having none at all, or focusing solely on the individual scientist and their choice of 
methods and procedures. 
S. Opićć New Polystochastic Statistical Inference in Social Sciences - 
Defining new Rules and Thresholds 
111
.
 
 
 
It is important to define when something exists or does not exist, and the criteria used 
to reject the null hypothesis when confirming an alternative hypothesis. When Fisher 
established the p<0.05 threshold, he acknowledged that it was not necessarily the best 
option available. The absence of any criteria is indeed worse than having an arbitrary 
yet compromise-based standard. Now, 100 years after Fisher introduced this threshold, 
we still struggle to reach a scientific consensus on how to conduct statistical inference. 
Authors tend to concentrate too much on identifying what is wrong, what requires 
change, and proving the “pollution” of our current model instead of seeking an effective 
solution. This solution will not be perfect; after all, determinism is increasingly less 
relevant as a scientific postulate, making way for a focus on probabilism. 
 
Polystochastic Statistical Inference in the Social Sciences  
 
Polystochastic Statistical Inference in the Social Sciences is a concept that combines a 
revised form of the Null Hypothesis Significance Testing (NHST) system, dependent 
on the sample size (n), with Bayesian inference and effect size, within certain limits. The 
framework of polystochastic statistical inference consists of two main components:  
1. Rejection of the null hypothesis (H0).  
2. The confirmation or rejection of the alternative hypothesis (Hn). 
 
When the sample size is >120 (n >120): 
 
 Use a significance level of p ≤ 0.01 (or smaller). For large samples (n > 100, 
i.e., 120), the sample mean reliably approximates a normal distribution, 
particularly in populations with pronounced skewness. A stricter criterion is 
necessary because sample size has a significant effect on statistical significance, 
as noted by Opić and Rijavec (2022). This leads to an increased risk of Type I 
error. With larger samples, even minor differences can appear statistically 
significant. Additionally, as the sample size increases, the standard error 
decreases. This is particularly important when dealing with pronounced 
asymmetry (skewness) or variability (variance), since sample size greatly 
influences the normalisation of the distribution. One valuable and effective 
method is bootstrapping (resampling), where the sample is treated as a 
population. 
112 
REVIJA ZA ELEMENTARNO IZOBRAŽEVANJE 
JOURNAL OF ELEMENTARY EDUCATION 
 
 
 
In Bayesian inference, sample size is less critical than in the Frequentist 
approach; consequently, a smaller sample size is sufficient to achieve the same 
level of efficiency, as discussed by Ali, Waheed, Shah, and Raza (2023). 
Although it is generally considered that the Central Limit Theorem (CLT) begins to 
apply when the sample size (n) is greater than 30, this is conditional. The assumption is 
that the population does not exhibit significant skewness or kurtosis. For populations 
with pronounced skewness or heavy tails (platykurtic distributions, such as the t-
distribution with low degrees of freedom), a much larger sample size is required to meet 
the prerequisite of normal distribution. However, this does not apply to distributions 
like the Pareto distribution, which is not influenced by the CLT since it has unlimited 
variance. For sample sizes greater than 120, the t-distribution closely resembles a normal 
distribution. This is why a sample size limit of 120 is defined. 
So, if we have a category of large samples (n>120), CLT also works in the case when 
asymmetries (skewness) and kurtosis (heavy-tailed) are expressed, which indicates that 
we meet the main prerequisite for normal distribution, which is required for parametric 
statistics. However, we then need to reduce the level of statistical significance to p≤0.01 
because the size of the sample affects the statistical significance; i.e., we will reject the 
null hypothesis sooner on large samples than on small samples. The 0.01 criterion is not 
too strict, and it is already used in medical research (often a much stricter criterion), and 
in research with high stakes; accordingly, it should become the default for social sciences 
as well. Therefore, stronger evidence is needed to reject the null hypothesis, but the type 
2 error does not increase significantly (as in the case of proposals, it is reduced to a very 
strict criterion, e. g., 0.0005; (Benjamin et al. 2017). 
Rationale for application of the criteria - there are no restrictions on the application of the 
change of criteria. The advantage is the fact that stronger evidence against the null 
hypothesis will be needed, thus reducing the type one error. Reducing the p-value 
criterion to p≤0.01 controls the influence of the sample size (n) on statistical 
significance when it comes to rejecting the null hypothesis, but it still does not solve the 
confirmation of the alternative hypothesis. 
 
When the sample is <120 (n<120) 
 
 Use the significance level p≤0.05 (or smaller). In smaller samples, the 
influence on statistical significance is not so pronounced. 
Both basic conditions refer only to confirming/rejecting the null hypothesis.
S. Opićć New Polystochastic Statistical Inference in Social Sciences - 
Defining new Rules and Thresholds 
113
.
 
 
 
To confirm the alternative hypothesis, at least 2 out of 3 criteria must be met:  
 
1. **Statistical Significance**: Confirmed level of statistical significance (n>120; 
p≤0.01 (or smaller), i.e. n<120, p≤0.05 or smaller) to reject the null hypothesis. The 
null hypothesis must be rejected (Mandatory). 
 
2.    **Effect Size**: The effect size should be at least moderate. The effect size (d) 
provides insight into the actual magnitude of differences in differential designs (Sullivan 
and Feinn, 2012; Balow, 2017). While statistical significance indicates that the result is 
unlikely to occur by chance, effect size quantifies how substantial the differences are. 
The most commonly used effect size in differential designs is Cohen's d (Cohen, 1968), 
where 0.2 is considered small, 0.5 medium, and 0.8 large. Therefore, to provide evidence 
in favour of the alternative hypothesis (Hn), it is necessary to meet the criterion of a 
medium (moderate) effect size for a given test. 
 
Table 1  
Shows the most commonly used effect sizes with reference values. 
 
Effect size small Medium large 
Cohen`s d (t test) 0.2 0.5 0.8 
Eta squared η
2 
(ANOVA) 0.01 0.06 0.14 
Cohen’s f (one way ANOVA/ANCOVA) 0.1 0.25 0.4 
Omega squared ω2 (ANOVA) 0.01 0.06 0.14 
Multivariate Omega squared ω2  
(one way ANOVA, MANOVA) 
0.01 0.06 0.14 
F-Squared  f2 (multiple nad partial corr) 0.02 0.15 0.35 
r Pearson 0.1 0.3 0.5 
Odds Ratio (OR) close to 1 Around 2 3or more 
Odds ratio (2*2) 1.5 3.5 9.0 
η2 (multiple regression) 0.02 0.13 0.26 
 Cohen’s ω (chi square) 0.1 0.3 0.5 
Spearman rho (Friedman) 0.1 0.3 0.5 
Cramer V (r x c frequency tables) 
0.1 (Min(r-1,c-
1)=1), 0.07 (Min(r-
1,c-1)=2), 0.06 
(Min(r-1,c-1)=3)  
0.3 (Min(r-1,c-
1)=1), 0.21 
(Min(r-1,c-
1)=2), 0.17 
(Min(r-1,c-1)=3)  
0.5 (Min(r-1,c-
1)=1), 
0.35(Min(r-1,c-
1)=2), 0.29 
(Min(r-1,c-
1)=3) 
(Source; Vacha-Haase and Thomson, 2004; Cohen, 1992; Cohen, 2008); https://imaging.mrc-
cbu.cam.ac.uk/statswiki/FAQ/effectSize)
114 
REVIJA ZA ELEMENTARNO IZOBRAŽEVANJE 
JOURNAL OF ELEMENTARY EDUCATION 
 
 
 
Of course, as with any other statistical indicator, there are limitations. For example, 
McGrath and Meyer, 2006) show that rpb (point-biserial correlation) is sensitive to 
sample size, but when it comes to unequal variances, this is also the case with Cohen’s 
d. (compare Ruscio, 2008), and a correction was proposed; they suggested larger values 
to represent effects (small-medium-large) as the group sizes become more unequal. 
Calculating effect size is an arbitrary procedure, similar to what Fisher noted about 
p<0.05. Therefore, it is recommended for use only when no better basis for estimating 
the index is available (Cohen, 1988, p. 25). However, there is no ideal statistical 
procedure, and there are no certain limitations, but the Effect size is very little 
influenced by the sample size and shows the real relationship between the variables 
(shown in the empirical part of the paper) and a very useful indicator in favour of the 
alternative hypothesis (Hn). 
Rationale for the application of the criteria – the list of effect sizes is large, and the author 
selects a specific one that corresponds to the test used to test the hypotheses. Table 1 
shows the most used ones, but this does not mean that the list is not expanding. 
However, the author chooses a certain and calculated value that should have at least a 
medium effect to fulfil this criterion - in favor of the alternative hypothesis (Hn). 
 
3. **Bayes factor**  
 
It should be BF(10) > 3, that is, indicating Moderate evidence for H1. 
Bayes factor and Bayesian inference are highly useful statistical approaches, and many 
papers indicate the advantages of using them over p-values (Stern, 2016., Hoijtink, van 
Kooten, Hulsker, 2016., Jarosz and Wiley, 2014., Assaf and Tsionas, 2018, Goodman, 
2008, 2005., Lavine and Schervish, 1999., Morey, Romeijn and Rouder, 2016, 
Andraszewicz et al, 2015). 
 
Bayes factor defined;  
 ( | 𝐇𝟏 )
 ( | 𝐇𝟎 )
,  where is the posterior probability. 
 
Pr (H0∣Data) =  
  ( ) ∣ )⋅   ( )
 ( )
 , analogously  
 
Pr (H1∣Data) = 
 ( ) ∣ )⋅   ( )
 ( )
 (Bayes theorem) 
 
The Bayes factor is a significant step forward in statistical inference, especially because 
it allows insight into the probability of an alternative hypothesis (which is not the case 
with NHST), but like all approaches in statistics, it has limitations. 
S. Opićć New Polystochastic Statistical Inference in Social Sciences - 
Defining new Rules and Thresholds 
115
.
 
 
 
One of these is the Jeffreys-Lindley paradox (Bartlett, 1957; Lindley, 1957). This 
concerns the influence of sample size on the BF value. In the frequentist approach, 
large samples affect lower p values, i. e., in favour of the Alternative hypothesis (Hn - 
or rejecting an H0), while in the Bayesian approach, large samples affect higher values 
of BF(01), i.e., in favour of H0. So, we have a paradox because the sample size in the 
frequentist approach significantly affects the probability of rejecting H0, but at the same 
time in the Bayesian approach, it can affect a higher probability in favour of H0 
(Huisman, 2023). 
In the literature and machine learning, the interpretation of BF is confusing: the 
interpretation of BF can be B10, i. e., alternative vs null hypothesis, or BF01, null vs 
alternative. Most often, when the label is not used, it means BF01. 
 
BF10>1: Evidence favours H1. BF10<1: Evidence Favors H0.  
BF01>1: Evidence favours H0. BF01<1: Evidence Favors H1.  
Bayes interpretation table (Adjusted to BF10; From Jeffreys, 1961) 
 
Table 2 
Bayes factor (BF10) 
 
> 100 Extreme evidence for H1 
30 100 Very strong evidence for H1 
10 30 Strong evidence for H1 
3 10 Moderate evidence for H1 
1 3 Anecdotal evidence for H1 
 1 No evidence 
1/3 1 Anecdotal evidence for H0 
1/10 1/3 Moderate evidence for H0  
1/30 1/10 Strong evidence for H0 
1/100 1/30 Very strong evidence for H0 
< 1/100 Extreme evidence for H0 
 
This condition stipulates that the Bayes Factor (BF10) must be at least 3 - indicating 
moderate evidence for the alternative hypothesis (H1). While a stricter criterion could 
have been applied, it would likely have caused more issues than benefits, particularly 
when considering certain limitations of Bayesian inference, such as the Jeffreys-Lindley 
paradox. This is especially relevant in cases of specific definition prior probability (P(θ)), 
particularly when using a non-informative prior (uniform).
116 
REVIJA ZA ELEMENTARNO IZOBRAŽEVANJE 
JOURNAL OF ELEMENTARY EDUCATION 
 
 
 
Rationale for the application of the criterion - the application of this criterion may be limited 
because e.g. in multivariate tests, the application of Bayesian inference (BF) is limited 
and under development. Moreover, in complex models, there are a number of 
limitations (challenges) in the application of BF (Bollen, Harden, Ray, and Zavisca, 
2014), including the problem of using an ordinal scale, and the problem of using BF in 
non-parametric tests (Yuan, and Johnson, 2008). Of course, there are always challenges, 
but at the same time, most of the works in the univariate approach have BF calculations 
in statistical programs for data processing, and further development and application are 
expected. 
We can therefore show Polystochastic Statistical Inference in the Social Sciences 
graphically (Table 3): 
 
Table 3 
Polystochastic Statistical Inference in the Social Sciences 
 
 
Rejecting null hypothesis (H0) 
In case n<120 
In case n>120 
p≤0,05 (or smaller) 
p≤0,01 (or smaller) 
Proving the alternative 
hypothesis (Hn) 
Rejected H0 
when n<120; p≤0,05 or smaller  
when n>120; p≤0,01 or smaller  
Condition 1 (mandatory) 
Bayes factor BF10 > 3 Condition 2 
Effect size - medium Condition 3 
 
A total of 2 out of 3 
conditions must be met 
An empirical example 
 
For the simulations (scenarios X1, X2, and X3), a matrix was utilized with the 
independent variable being study type (undergraduate, graduate, integrated study; 
∑n=75) and the dependent variable measured on an ordinal scale using a Likert scale 
with 5 points. 
Differences between sub-groups were tested using ANOVA; 
S. Opićć New Polystochastic Statistical Inference in Social Sciences - 
Defining new Rules and Thresholds 
117
.
 
 
 
Scenario X1, n=75 
 
X1 
𝑛 1 = 25
𝑛 2 = 21
𝑛 3 = 29
  
 ⎯ ⎯ ⎯ F(2.72) =1.451, p=0.241;    
x̄1 = 3.40; σ1 =  1.258; stdErorr = 0,252
x̄2 = 3.71; σ2 =   1.056; stdErorr = 0,230
x̄3 = 3.90; σ3 =   0.900; stdErorr = 0,167
  
                               MSB=1.672; MSW=1,152 
 
Effect size; η
2
= 0.039 ; CI (95%) = 0.000 lower  to  0.138 upper 
                   ε
2
=0.012; CI (95%) =-0.028 lower to  0.114 upper 
 
BF(10) =0.052 (JZS) 
 
In a sample of n=75, the null hypothesis, which posits that there are no differences 
between the subsamples concerning the dependent variable, is confirmed. The effect 
size, measured by eta squared, indicates a very weak real difference. Moreover, Bayesian 
inference shows a Bayes Factor of BF(10) =0.052, which does not lend support to the 
alternative hypothesis (Hn).  
When the results are multiplied in a larger matrix sample of n=150, the findings are as 
follows (scenario X2): 
Scenario X2; n=150 
 
X2 
𝑛 1 = 50
𝑛 2 = 42
𝑛 3 = 58
 
 ⎯ ⎯ ⎯ F(2.147) =2.963, p=0.055  
x̄1 = 3.40; σ1 =  1.245; stdErorr = 0.176
x̄2 = 3.71; σ2 =   1.043; stdErorr = 0.161
x̄3 = 3.90; σ3 =   0.892; stdErorr = 0.117
 
                            MSB=3.345; MSW=1.129 
 
Effect size; η
2
= 0.039; CI(95%)=0.000 lower – 0.108 upper 
                   ε
2
=0.026; CI(95%)=-0.014 lower – 0.096 upper 
 
BF(10) =0.117 (JZS) 
 
By duplicating the results in the matrix, the arithmetic mean remained unchanged (n1, 
n2, n3). However, the standard errors decreased because the denominator includes √n. 
The results of the F ratio suggest that we are nearing the threshold for rejecting the null 
hypothesis at a statistical significance level of p < 0.05. Nonetheless, the effect size 
values remained the same (η² = 0.039). Additionally, the Bayesian inference results 
(BF10 = 0.117) do not support the alternative hypothesis. 
Then, multiplying the results in the matrix (N=300), the results are as follows (scenario 
X3): 
118 
REVIJA ZA ELEMENTARNO IZOBRAŽEVANJE 
JOURNAL OF ELEMENTARY EDUCATION 
 
 
 
Scenario X3; n=300 
 
X3
   𝑛 1 = 100
𝑛 2 = 84
𝑛 3 = 116
 ⎯ ⎯ ⎯ F(2.297)=5.896;p=0.003;  
x̄1 = 3.40; σ1 =  1.239; stdErorr = 0.124
x̄2 = 3.71; σ2 =   1.036; stdErorr = 0.113
x̄3 = 3.90; σ3 =   0.888; stdErorr = 0.082
 
                       MSB=6.689; MSW=1.118 
 
Effect size; η
2
= 0.039; CI (95%) = 0.005 lower to  0.086 upper 
                   ε
2
= 0.032; CI (95%) = - 0.002 lower  to 0.080 upper 
 
BF(10) =1.146 (JZS) 
 
For a sample of 300 respondents, the results indicate a rejection of the null hypothesis, 
with a p-value of 0.003. In scenario X3 (n=300), the arithmetic means remained 
unchanged since the data set was identical. However, the null hypothesis significance 
testing (NHST) still led to a rejection of the null hypothesis (p=0.003). The effect size, 
measured by η², was consistent at 0.039 across the x1, x2, and x3 models, indicating a 
very weak real difference. Additionally, the Bayesian inference showed a Bayes Factor 
(BF10) of 1.146, which does not support the alternative hypothesis (Hn).  
The Jeffreys-Lindley paradox is not evident in this case, despite using a non-informative 
prior (a uniform prior over the mean). This is because increasing the sample size (n) did 
not result in a decrease in the Bayes factor in support of the null hypothesis. Clearly, the 
sample size has a stronger impact on the p-value than on the Bayes factor. In this case, 
the null hypothesis is rejected in the X3 model. However, there is not enough evidence 
to confirm the differences between the subsamples, as neither Bayesian analysis nor 
effect size support such a conclusion. According to the polystochastic inference 
approach, to validate the alternative (affirmative) hypothesis, at least two out of three 
conditions must be met. In this instance, only one condition has been satisfied. 
Therefore, even though the null hypothesis was rejected in scenario 3, the alternative 
hypothesis (Hn) regarding the existence of differences between the subsamples is not 
confirmed. 
In the subsequent section, a new simulation is introduced (y1; y2; y3; y4). The 
differences between male students (n=32) and female students (n=40) regarding the 
dependent variable (The online classes were well organized) using a sample size of n=72 were 
tested. A T-test for independent samples was conducted. The results are as follows: 
S. Opićć New Polystochastic Statistical Inference in Social Sciences - 
Defining new Rules and Thresholds 
119
.
 
 
 
Scenario Y1; n = 72 
 
Y1
𝑛1 = 32
𝑛2 = 40
    
 ⎯ ⎯ ⎯ t(70) = -0.956, p=0.342     
 x ̄ 1 = 3.41    σ1 = 1.012; stdError = 0.179 
x̄2 = 3.65    σ2 = 1.122; stdError = 0.177
 
 
Although the disproportion of the subsample is partially expressed, the group is 
homogeneous; F (70;68.957) =0.126, p=0.724. Also, when sampling the distribution, 
there is no significant asymmetry, skew=-0.813, nor is it a pronounced significant 
leptokurtic distribution (Kurtosis=0.272). Analogously, in the case of subsamples, the 
sampling distribution is not markedly asymmetric, nor is significant kurtosis 
pronounced. 
Effect size; Cohen’s d = - 0.227; CI (95%) = -0.692 lower  to  0.240 upper 
                   Hedges´ correction = - 0.224; CI (95%) = - 0.685 lower  to 0.238 upper 
                   Glass delta = - 0.217; CI (95%) = - 0.683 lower  to 0.251 upper 
 
BF(01) =3.650, posterior distribution in intervals is shown in Figure 1 
 
 
 
Figure 1 - posterior distribution (Credible Intervals) 
 
Thus, on a sample of 72 subjects, the NHTS approach is confirmed by the null 
hypothesis of no difference between subsamples, Cohen’s d indicates a very low 
difference between subsamples, nor does and BF(01) favour the alternative (affirmative) 
hypothesis.  
 
In the further simulation (Y2), the results in the matrix were multiplied; (n=144), the 
results are as follows: 
120 
REVIJA ZA ELEMENTARNO IZOBRAŽEVANJE 
JOURNAL OF ELEMENTARY EDUCATION 
 
 
 
Scenario Y2; n = 144 
 
Y2  
𝑛 1 = 64
𝑛 2 = 80
 
 ⎯ ⎯ ⎯ t (142)= - 1,362; p= 0,175; 
 x ̄ 1 = 3.41    σ1 = 1.003; stdError = 0.125 
x̄2 = 3.65    σ2 = 1.115; stdError = 0.125
 
 
The set is homogeneous; F (142;139,994)=0.256, p=0.614 
Effect size; Cohen’s d = - 0.228; CI (95%) = - 0.558 lower  to  0.102 upper 
                   Hedges´ correction = -0.227; CI (95%) = - 0.555 lower  to  0.101 upper 
                   Glass delta = - 0.219; CI (95%)  = - 0.548 ower  to  0.113 upper 
 
BF(01) =3.173  
 
There was a decrease in the p-value (0.342 to 0.175), which still does not indicate the 
rejection of the null hypothesis, and at the same time, the BF and the effect size are not 
in favor of H1. In the further simulation (Y3), the results in the matrix are multiplied; 
(n=288), the results are as follows: 
 
Scenario Y3, n = 288 
 
Y3 
𝑛 1 = 128
𝑛 2 = 160
 ⎯ ⎯ ⎯  t(286) = -1,933; p= 0,054 
 x ̄ 1 = 3.41    σ1 = 1.000; stdError = 0.088 
x̄2 = 3.65    σ2 = 1.111; stdError = 0.088
 
 
The set is homogeneous; F (286;282,067)  = 0.515, p=0.474 and t value is used: equal variance assumed. 
Effect size; Cohen´s d = -0.229;  CI(95%)= - 0.462 lower  to  0.004 upper 
                   Hedges´ correction= -0.229; CI(95%)= -0.461 lower  to 0.004 upper 
                   Glass delta =-0.219; CI(95%)= -0.453 ower  to  0.015 upper 
 
BF(01) =1.745 
 
In the Y3 simulation, the impact of sample size on statistical significance is evident. At 
the p < 0.05 level, the null hypothesis (H0) can be rejected since it is at the threshold 
value. However, it is not rejected at the p ≤ 0.01 level. The arithmetic means, Cohen's 
d (effect size), Hedges’ g correction, and Glass delta all remain unchanged (very small 
differences) and indicate a small effect. Additionally, the Bayes Factor BF(01) does not 
support the alternative hypothesis. 
And finally, we have the Y4 simulation (n=576) 
 
S. Opićć New Polystochastic Statistical Inference in Social Sciences - 
Defining new Rules and Thresholds 
121
.
 
 
 
Scenario Y4, n=576 
 
Y4 
𝑛1 = 256
𝑛2 = 320
 ⎯ ⎯ ⎯  t (574)=-2.739; p = 0.006 
 x ̄ 1 = 3.41    σ1 = 0.998; stdError = 0.062 
x̄2 = 3.65    σ2 = 1.110; stdError = 0.062
 
 
Effect size; Cohen’s d = - 0.230;  CI (95%) = - 0.394 lower  to  - 0.065 upper 
                   Hedges´ correction = - 0.229; CI (95%)= - 0.394 lower  to  -0.065 upper 
                   Glass delta = - 0.220; CI (95%) = - 0.385 ower  to  - 0.054 upper 
 
BF(01) = 0.378, or BF(10) =1/BF(01) =2.64. The posterior distribution in the intervals is shown in Figure 
2 (prior is flat; noninformative) 
 
Figure 2- posterior mean difference (Mean diff Posterior) 
 
In this case, the null hypothesis is rejected at a significance level of p < 0.01 since p = 
0.006. However, there is still no evidence to support the alternative hypothesis. Cohen’s 
d is -0.230, indicating a low effect size, and the Bayes Factor (BF01) is 0.378, which 
means that BF10 is 1/BF01, resulting in BF10 = 2.64. Although the value of BF10 
(favouring the alternative hypothesis) increased with the sample size, moderate evidence 
for the alternative hypothesis (Hn) was still not achieved. 
In the simulations involving scenarios X1, X2, X3, X4, and Y1, Y2, Y3, Y4, Y5, the 
influence of the sample size of the respondents is evident. Additionally, the 
effectiveness of the Polystochastic Statistical Inference in the Social Sciences approach 
is highlighted, as it controls for type 1 errors (n > 120, p < 0.01) and the probabilities 
of confirming the alternative (affirmative) hypothesis (Hn). 
 
Conclusion 
 
Even after 100 years since the significant contributions of Sir Ronald Aylmer Fisher to 
the field of statistical inference, many papers published today continue to demonstrate 
that this approach has major flaws. It often leads to misconceptions, incorrect 
interpretations, wrong conclusions, and generalizations. 
122 
REVIJA ZA ELEMENTARNO IZOBRAŽEVANJE 
JOURNAL OF ELEMENTARY EDUCATION 
 
 
 
Furthermore, it is estimated that a substantial percentage of papers—approximately 
80%—in the social sciences arrive at erroneous conclusions based on the null 
hypothesis significance testing (NHST) approach. This increasingly resembles Gödel’s 
Incompleteness Theorem, which, when interpreted, relates to the idea of proving something 
that cannot be proven. However, as early as 1925, Fisher acknowledged that this 
approach was not the best solution. Today, numerous papers highlight the 
shortcomings of the existing NHST system and the limitations of other methodologies. 
Polystochastic statistical inference in the social sciences introduces a new approach that 
clearly defines the boundaries of statistical inference. By lowering the p-value threshold 
from 0.05 to 0.01 (or smaller) for large samples (n > 120), we can better control the 
influence of sample size on statistical significance, effectively reducing the risk of a Type 
I error. While some research suggests that an even stricter criterion may be necessary, 
this can lead to an increased risk of a Type II error. There is no universally ideal 
threshold. However, the significant advantage of the polystochastic statistical inference 
approach in the social sciences lies in its ability to support an alternative (affirmative) 
hypothesis when the null hypothesis is rejected. 
To confirm the alternative hypothesis, 2 of 3 conditions must be met; the compulsory 
condition is that the null hypothesis is rejected, then the Effect size is at least medium, 
and BF (10) > 3. We could see this as the need to introduce a stricter criterion (e.g., BF 
(10) >10 or more, indicate the limitations of Bayesian inference for complex models 
(which is correct), or indicate the operation of the Jeffreys-Lindley paradox, the 
problematic nature of the non-informative prior. However, Polystochastic Statistical 
Inference in the Social Sciences offers a framework that provides clear rules (thresholds) 
for statistical inference in the social sciences. The approach is set to allow the author to 
control the influence of sample size on the probability of rejection of the null 
hypothesis, but what is more important is that it has a framework for confirming the 
alternative hypothesis. The author has the option of choosing conditions (2/3) because 
it is assumed that for certain multivariate tests, statistical programs still do not offer 
Bayesian, or, for example, with certain non-parametric tests, Bayesian is not yet often 
being used (or is controversial). 
The new approach, Polystochastic Statistical Inference in the Social Sciences, represents 
a significant advance in statistical inference within this field, providing clear rules and 
thresholds. It maintains flexibility in its application, avoiding a substantial increase in 
Type II error, even if we were to pursue a further reduction in p-values. Additionally, it 
offers a balanced method for confirming alternative hypotheses. 
S. Opićć New Polystochastic Statistical Inference in Social Sciences - 
Defining new Rules and Thresholds 
123
.
 
 
 
Authors are encouraged to specify which approach they have chosen in their work, 
whether it be NHST or Polystochastic Statistical Inference (PSSI). Beyond this, the 
approach provides valuable statistical insights, such as confidence intervals and credible 
intervals, aimed at enhancing our understanding of the data. Ultimately, PSSI establishes 
a clear framework and threshold for statistical inference in the social sciences. 
Acknowledgment - I would like to thank my colleague Irena Tadic for the matrix used 
for the empirical part and the reviewers for their valuable suggestions. I am grateful to 
the Centre for Educational Measurement and Assessment (CEMA) for its support and 
to Fisher Library the University of Sydney for the space to work (100 years after Fisher).  
 
References 
 
Ali, S., Waheed, M., Shah, I., and Raza, S. M. M. (2023). Bayesian sample size determination for coefficient 
of variation of normal distribution. Journal of Applied Statistics, 51(7), 1271–1286. https://doi.o–
rg/10.1080/02664763.2023.2197571 
Andraszewicz, S., Scheibehenne, B., Rieskamp, J., Grasman, R., Verhagen, J., and Wagenmakers,
 
E. J. 
(2015). An Introduction to Bayesian Hypothesis Testing for Management Research. Journal of 
Management, 41(2), 521-54. https://doi.org/10.1177/0149206314560412 
Assaf, G. A., and Tsionas, M. (2018). Bayes factors vs. P-values. Tourism Management, 67,17-31. 
Balow, C. (2017). The “effect size” in educational research: What is it and how to use it? [sic] Illuminate 
Education. Retrieved from  
www.illuminateed.com/blog/2017/06/effect-size-educational-research-use/ on July 14, 2019. 
Bartlett, M. S. (1957). A comment on D. V. Lindley’s statistical paradox. Biometrika, 44, 533–553. 
Benjamin, D. J., Berger, J. O., Johannesson, M. et al. (2017). Redefine statistical significance. Nature Human 
Behaviour, 2(1), 6-10.doi: 10.1038/s41562-017-0189-z.  
Bollen, K. A., Harden, J. J., Ray, S., and Zavisca, J. (2014). BIC and Alternative Bayesian Information 
Criteria in the Selection of Structural Equation Models. Structural Equation Modeling: A 
Multidisciplinary Journal, 21,1–19. 
Cohen, J. (1969). Statistical Power Analysis for the Behavioral Sciences (1st ed.). New York, NY: Academic Press. 
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. 
Cohen, J. (1992). Power Primer. Psychological Bulletin [PsycARTICLES]; 112, 1; PsycARTICLES, pp. 155-
159. 
De Ruiter, J. (2019). Redefine or justify? Comments on the alpha debate. Psychonomic Bulletin & Review, 26, 
430–433, https://doi.org/10.3758/s13423-018-1523-9. 
Fisher, R. A. (1925). Statistical Methods for Research Workers. Edinburgh, UK: Oliver and Boyd.  
Goodman, S. (2005). Introduction to Bayesian methods I: Measuring the strength of evidence. Clinical 
Trials, 2 (4), 282-290. 
Goodman, S. (2008). A Dirty Dozen: Twelve P-Value Misconceptions. Seminars in Hematology, 45 (3), 135-
140. 
Hoijtink, H., van Kooten, P., and Hulsker, K. (2016). Why Bayesian psychologists should change the way 
they use the Bayes Factor. Multivariate Behavioral Research, 51, 2-10. doi: 10.1080/0027317–
1.2014.969364 
Holmberg, A. (2024). Toward a Better Understanding of Statistical Significance and p Values in Nursing. 
Nursing Forum, Article ID 7263781, 1-11 pageshttps://doi.org/10.1155/2024/7263781 
Huisman, L. (2023). Are P‐values and Bayes factors valid measures of evidential strength? Psychonomic 
Bulletin & Review, 30, 932–941. https://doi.org/10.3758/s13423-022-02205-x 
124 
REVIJA ZA ELEMENTARNO IZOBRAŽEVANJE 
JOURNAL OF ELEMENTARY EDUCATION 
Jarosz, A., and Wiley, J. (2014). What are the odds? A practical guide to computing and reporting Bayes 
factors. Journal of Problem Solving, 7, 2-9. doi: 10.7771/1932-6246.1167 
Jeffreys H. (1961). Theory of Probability (3rd ed.). New York: Oxford University Press. 
Lakens, D., Adolfi, F., Albers, C., Anvari, F., Apps, M., Argamon, S., ... Bradford, D. (2018). Justify your 
alpha. Nature Human Behavior, 2, 168-171. 
Lavine, M., and Schervish, M. J. (1999). Bayes factors: What they are and what they are not. The American 
Statistician, 53 (2), 119-122. 
Lindley, D. V. (1957). A statistical paradox. Biometrika, 44, 187–192. 
McGrath, R. E., and Meyer, G. J. (2006). When effect sizes disagree: the case of r and d. Psychological Methods, 
11(4), 386-401. doi: 10.1037/1082-989X.11.4.386. 
McShane, B. B., Gal, D., Gelman, A., Robert, C., and Tackett, J. L. (2019). Abandon Statistical Significance. 
The American Statistician, 73 (1), 235-245. 
Morey, R. D., Romeijn, J.W., and Rouder, J. N. (2016). The philosophy of Bayes factors and the 
quantification of statistical evidence. Journal of Mathematical Psychology, 72, 6–18. 
Opić, S, and Rijavec, M. (2022). Misconceptions of the p-value - let us use new approaches and procedures 
// 2. Međunarodna znanstvena i umjetnička konferencija Suvremene teme u odgoju i 
obrazovanju - STOO 2 In memoriam Prof. Emer. Dr sc. Milan Matijević u suradnji sa Zavodom 
za znanstvenoistraživački i umjetnički rad Hrvatske akademije znanosti i umjetnost / D. Velički 
and M. Dumančić, (eds.). Zagreb: Sveučilište u Zagrebu Učiteljski fakultet, 2022. pp. 1-21. 
Pearson, K. X. (1900). On the Criterion that a Given System of Deviations from the Probable in the Case 
of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen 
from Random Sampling.  London, Edinburgh and Dublin Philosophical Magazine and Journal of 
Science, 50 (302), 157–175, https://doi.org/10.1080/14786440009463897  
Perezgonzalez, J. D. (2015). Fisher, Neyman-Pearson or NHST? A Tutorial for Teaching Data Testing, 
Frontiers in Psychology 6. https://doi.org/10.3389/fpsyg.2015.00223- 
Rovetta, A. (2024). Abandon Statistical Significance: A Gentle Introduction to S-values and S-intervals.  
https://doi.org/10.31219/osf.io/ywhu9 (https://osf.io/preprints/osf/ywhu9) 
Ruscio, J. (2008). A Probability-Based Measure of Effect Size: Robustness to Base Rates and Other Factors. 
Psychological Methods, 13(1), 19 –30. 
Sullivan, G. M., and Feinn, R. (2012). Using effect size—Or why the p-value is not enough. Journal of 
Graduate Medical Education, 4(3), 279–282. https://doi.org/10.4300/JGME-D-12-00156.1 
Stern, H. S. (2016). A test by any other name: P-values, Bayes Factors, and statistical inference. Multivariate 
Behaviour Research, 51(1), 23-39. doi:10.1080/00273171.2015.1099032 
Vacha-Haase, T., and Thomson, B. (2004). How to Estimate and Interpret Various Effect Sizes. Journal of 
Counseling Psychology, 51(4), 473–481.  https://doi.org/10.1037/0022-0167.51.4.473 
Wasserstein, R. L., and Lazar, N. A. (2016). ASA Statement on Statistical Signiﬁcance and P-Values - 
Context, Process, and Purpose. The American Statistician, 70 (2), 129-133.  
Yuan, Y., Johnson, V. E. (2008). Bayesian Hypothesis Tests using Nonparametric Statistics. Statistica Sinica, 
18 (3), 1185-1200. 
Original quote: an erratum (page 110);
Similarly, de Ruiter (2019), while critiquing the proposal to lower the significance level to 0.005, 
argues that setting an alpha level of p ≤ 0.005 does not enhance replicability. He believes that the 
rationale for adopting a new alpha level of 0.005 is weak and that such a change could potentially 
harm scientific practice
Author 
Siniša Opić, PhD 
Full Professor, Universtiy of Zagreb, Faculty of Teacher Education, Savska 77, 10000 Zagreb, Croatia, e-
mail: sinisa.opic@ufzg.hr 
Redni profesor, Univerza v Zagrebu, Pedagoška fakulteta, Savska 77, 1000 Zagreb, Hrvaška, e-pošta: 
sinisa.opic@ufzg.hr