Informatica 42 (2018) 175–188 175
A Modiﬁcation of the Lasso Method by Using the Bahadur Representation for
the Genome-Wide Association Study
Lev V . Utkin
Peter the Great St.Petersburg Polytechnic University, St.Petersburg, Russia
E-mail: lev.utkin@gmail.com
Yulia A. Zhuk ITMO University, St.Petersburg, Russia
E-mail: zhuk_yua@mail.ru
Keywords: data analysis, feature selection, Lasso, Bahadur representation, genome-wide association study
Received: September 24, 2016
A modiﬁcation of the Lasso method as a powerful machine learning tool applied to a genome-wide associ-
ation study is proposed in the paper. From the machine learning point of view, a feature selection problem
is solved in the paper, where features are single nucleotide polymorphisms or DNA-markers whose asso-
ciation with a quantitative trait is established. The main idea underlying the modiﬁcation is to take into
account correlations between DNA-markers and peculiarities of phenotype values by using the Bahadur
representation of joint probabilities of binary random variables. Interactions of DNA-markers called the
epistasis are also considered in the framework of the proposed modiﬁcation. Various numerical experi-
ments with real datasets illustrate the proposed modiﬁcation.
Povzetek: Predstavljena je modiﬁkacija metode strojnega uˇ cenja, imenovana Lasso.
1 Introduction
One of the important area for a successful application of the
artiﬁcial intelligence, in particular, machine learning algo-
rithms, is the computational biology which can be regarded
as a basis for many engineering problems in biotechnology.
An interesting task clearly illustrating the application of ar-
tiﬁcial intelligence to biotechnology problems is a genome-
wide association study (GWAS). GWAS examines the as-
sociation between phenotypes or quantitative traits and ge-
netic variants or genotypes across the entire genome. In the
machine learning framework, it can be regarded as one of
the methods for a feature selection problem where features
are the so-called single nucleotide polymorphisms (SNPs)
or DNA-markers. As pointed out in [12, 15], there are some
difﬁculties of solving this feature selection problem. First
of all, the number of SNPsp is typically 10–100 times the
number of individualsn in the training sample. This is the
so calledp > n (or largep smalln) problem, which leads
to difﬁculty of an oversaturated model. Another difﬁculty
is that SNPs may affect phenotype in a complicated and
unknown manner. For example, some DNA-markers may
interact in their effects on phenotype. This interaction is
called the epistatic effect.
A huge amount of the statistical models and methods sol-
ving the SNP selection problem have been developed the
last decades. A part of methods can be referred to as ﬁlter
methods [1, 28] which use statistical properties of SNPs to
ﬁlter out poorly informative ones. A review of ﬁlter met-
hods in GWAS is proposed by Zhang et al. [53]. Thet-test,
Fisher criterion (F -statistics),   2
-statistics, ANOV A tests
are the well-known statistical methods for detecting diffe-
rential SNPs between two samples in training data.
Another part of methods called wrapper methods gene-
rally provides more accurate solutions than the ﬁlter met-
hods, but it is computationally demanding [24]. One of
the well-known wrapper methods proposed by Guyon et
al. [17] and called the Recursive Feature Elimination has
been applied to the gene selection problem for cancer clas-
siﬁcation.
Filter methods and their modiﬁcations as well as wrap-
per methods may be efﬁcient tools for solving the problems
of GWAS. At the same time, a lot of methods of the fea-
ture selection use regression models. One of the pioneering
and the most well-known papers devoted to the use of re-
gression models in SNP selection has been written by Lan-
der and Botstein [27]. Methods for constructing the corre-
sponding regression models can be referred as embedded
methods [25]. They performs feature selection in the pro-
cess of model building and cover a lot of well-known ap-
proaches, including the Ridge regression, Least Absolute
Shrinkage and Lasso techniques [41] which are the most
popular and efﬁcient tools in SNP selection problems. The
main advantage of using the Lasso method is that it per-
forms variable selection and classiﬁcation or regression si-
multaneously. A lot of approaches using the Lasso met-
hod and its modiﬁcations have been developed for solving
the SNP selection problem in the framework of the GWAS
[13, 31, 35, 36, 40, 43]. Hayes [18] provided a comprehen-
sive overview of statistical methods for GWAS in animals,
176 Informatica 42 (2018) 175–188 L.V . Utkin et al.
plants, and humans. Various approaches to SNP selection
with the Lasso algorithm and other methods can be also
found in papers [16, 22, 33, 46].
The main aim of GWAS is to identify SNPs that are di-
rectly associated with a trait, i.e., the standard GWAS ana-
lyzes each SNP separately in order to identify a set of signi-
ﬁcant SNPs showing genetic variations associated with the
trait. However, an important challenge in the analysis of
genome-wide data sets is taking into account the so-called
epistatic effect when different SNPs interact in their asso-
ciation with phenotype.
Campos et al. [12] explain some shortcomings of the
standard GWAS. They write that the currently identiﬁed
SNPs might not fully describe genetic diversity. For in-
stance, these SNPs may not capture some forms of genetic
variability that are due to copy number variation. Moreo-
ver, genetic mechanisms might involve complex interacti-
ons among genes and between genes and environmental
conditions, or epigenetic mechanisms which are not fully
captured by additive models. Many statistical approaches
make sense under the assumption that only a few genes af-
fect genetic predisposition. However, GWAS may be unsa-
tisfactory for many important traits which may be affected
by a large number of small-effect, possibly interacting, ge-
nes. Limitations and pitfalls of prediction analysis in the
framework of the GWAS have been discussed in detail by
Wray et al. [47] where it is shown how naive implementati-
ons can lead to severe bias and misinterpretation of results.
In fact, the epistatic effect can be viewed as gene-gene
interaction when the action of one locus depends on the
genotype of another locus. At the same time, there are dif-
ferent interpretations of the epistatic effect. A fundamental
critical review of different deﬁnitions and interpretations
of epistasis is provided by Cordell [11] where it is pointed
out that there are many conﬂicting deﬁnitions of epistasis,
which lead to certain problems in interpretations, namely,
the statistical interaction may not correspond to the biolo-
gical models of epistasis. As indicated by Wan et al. [44],
there are mainly three different deﬁnitions of gene-gene in-
teractions: functional, compositional and statistical epista-
sis. We consider only the statistical epistasis which can be
regarded as the statistical deviation from the joined effects
of two SNPs on the phenotype. At that, the individual SNPs
may exhibit no marginal effects.
A lot of methods dealing with epistasis effect have been
developed last decades [3, 30, 52, 49, 50, 51, 54]. Compre-
hensive and interesting reviews of methods detecting inte-
racting the epistatic effect were provided by several authors
[7, 45]).
Analyzing various modiﬁcations of the Lasso method
applied to the GWAS problems, we can point out that
many efﬁcient modiﬁcations are based on applying spe-
cial forms of the penalty function, which take into account
some additional information about SNP markers and the
corresponding phenotype values. Some interesting algo-
rithms [33, 42] devoted to various penalty functions will
be studied in the next section. The use of a speciﬁc addi-
tional information allows us to improve the GWAS and is
considered in the paper.
In the present study, we modify the Lasso method by ta-
king into account some peculiarities of the double haploid
(DH) lines of barley which are very important in the plant
biotechnology. According to the DH method, only two ty-
pes of genotypes occur for a pair of alleles. From a sta-
tistical point of view, we solve a linear regression problem
with binary explanatory variables. Our method is based
on the well-known adaptive Lasso [56] and takes into ac-
count additional information about the correlation between
SNPs, frequencies of alleles and expected phenotype va-
lues. We propose to use the Bahadur representation [2] by
partially applying the ideas provided by Lee and Jun [29]
where the authors propose to apply the Bahadur representa-
tion to classiﬁcation problems. The Bahadur representation
allows us to compute joint probabilities of SNPs by taking
into account the correlation between binary random varia-
bles. That is another reason why we analyze only DH lines
which produce the binary genotypes. In order to modify
the adaptive Lasso, we propose to assign penalty weights
in accordance with expected values of the phenotype with
respect to a probability mass function somehow deﬁned on
the genotype values. In other words, computing the ex-
pected values of the phenotype in a special way is a main
idea of the proposed method. We show that the proposed
modiﬁcation is directly extended on the case taking into
account the epistatic effect.
2 The Lasso method
We analyzen double haploid (DH) lines of barley or a po-
pulation from n individuals. From a statistical point of
view, marker genotypes can be treated as qualitative expla-
natory variables, i.e., X
j
= (x
1j
;:::;x
nj
)
T
is a predictor
representing thej-th SNP,j = 1;:::;p. Herex
ij
is a binary
variable, i.e.,x
ij
2f0;1g. A quantitative trait of interest
or a set of the phenotype valuesy
i
2R,i = 1;:::;n, can be
regarded as the response vectorY = (y
1
;:::;y
n
)
T
. We also
denote X = [X
1
;:::;X
p
] is a genotype matrix forn lines
or individuals or a predictor matrix in terms of statistics;
x
T
i
= (x
i1
;:::;x
ip
) is a vector of alleles corresponding to
thei-th line,i = 1;:::;n.
First, we focus on the standard linear regression model
y =
p
X
i=1
  i
X
i
+  0
+  = X  +  0
+ : (1)
Here  is a noise variable with the zero-valued expectation;
  i
is the SNP effect,  = (  1
;:::;  p
).
Without loss of generality, we assume the predictors and
the response are centered, and the predictors are standardi-
zed, that is
n
X
i=1
y
i
= 0;
n
X
i=1
x
ij
= 0;
n
X
i=1
x
2
ij
= 1; X
i
2R
p
:
A Modiﬁcation of the Lasso Method by. . . Informatica 42 (2018) 175–188 177
This implies that the intercept is not included in the regres-
sion function.
The Lasso is a regularization technique for simultaneous
estimation and variable selection [41]. The Lasso estima-
tes are deﬁned from the following quadratic programming
problem:
  = arg min
  2R
p
kY  X  k
2
;
subject to
p
X
j=1
j  j
j  s
for somes> 0. The Lagrange formulation is
  = arg min
  2R
p
kY  X  k
2
+  p
X
j=1
j  j
j;
where  is a nonnegative regularization parameter. The se-
cond term is theL
1
penalty which is crucial for the success
of the Lasso. The Lasso estimator is usually calculated at a
grid of tuning parameters of  , and a cross-validation pro-
cedure is subsequently used to select an appropriate value
of  .
The Lasso penalizes the regression coefﬁcients by their
L
1
norm. However, in order to improve the performance
of the Lasso, the regression coefﬁcients can be penalized
individually. As a result, we write the weighted Lasso esti-
mates as follows:
  = arg min
  2R
p
kY  X  k
2
+  p
X
j=1
w
j
j  j
j;
wherew
j
> 0,j = 1;:::;p, are weights determined a priori
in accordance with some rules. A larger weightw
j
corre-
sponds to a higher penalty and discourages thej-th predic-
tor from the model. Conversely, a smaller weightw
j
exerts
less penalty and encourages selection of the corresponding
predictor [55].
The penalized Lasso can be reformulated as the standard
Lasso problem [6]. If we introduce new covariates and re-
gression coefﬁcients as
e x
ij
=x
ij
=w
j
; i = 1;:::;n;
e
  j
=  j
w
j
;
then the weighted Lasso problem can be rewritten as fol-
lows:
  = arg min
  2R
p
      Y  e
X
e
        2
+  p
X
j=1
      e
  j
      ;
where
e
  and
e
X are the vector and the matrix with elements
e
  j
ande x
ij
, respectively.
Zou [56] proposed one of the methods for determining
the weightsw
j
such thatw
j
= 1=j  init;j
j, where  init;j
is
a prior estimator of  j
, for example, the least square esti-
mator. The corresponding Lasso problem is referred as the
adaptive Lasso, and it has many nice properties improving
the performance of the Lasso. Moreover, it can be a basis
for constructing the boosting Lasso [6].
The Lasso has many interesting properties which make
the method to be very popular. At the same time, Zou and
Hastie [57] point out that in spite of success of the Lasso
it has some limitations, in particular, if there is a group of
variables among which the pairwise correlations are very
high, then the lasso tends to select only one variable from
the group and does not care which one is selected. In other
words, the Lasso tends to put all the weight on the selected
variable. On the one hand, this is a shortcoming. Many
methods have been proposed to overcome this obstacle, for
example, the so-called the elastic net [57] where the esti-
mates are deﬁned by
  = arg min
  2R
p
      Y  e
X
e
        2
+  1
p
X
j=1
      e
  j
      +  2
      e
        2
:
However, the elastic net requires to assign an additional pa-
rameter  2
whose value is a priori unknown. On the other
hand, in contrast to the ridge regression which tends to se-
lect all of the correlated variables and make the correspon-
ding coefﬁcients to be equal, the Lasso selects a group of
correlated variables and “isolates” it.
A special choice of the penalty term on the basis of some
prior information about SNPs or about an exploited ge-
nome selection model itself may lead to a series of useful
or important properties of the regression or classiﬁcation
model corresponding to the Lasso. Liu et al. [33] tried
to apply the observed fact that there exists a natural grou-
ping structure in SNPs and, more importantly, such groups
are correlated. The authors proposed a new penalization
method for group variable selection which can properly ac-
commodate the correlation between adjacent groups. Their
method referred to as smoothed group Lasso is based on
a combination of the group Lasso penalty and a quadratic
penalty on the difference of regression coefﬁcients of adja-
cent groups. Liu et al. [33] assume that SNPs are divided
into J groups, each with size d
j
, j = 1;:::;J, according
to their physical locations and correlation patterns. As a
results, the vector   = (  1
;:::;  J
) is deﬁned for groups
of SNPs, but not for separate SNPs,  j
is the vector of pa-
rameters corresponding to SNPs from thej-th group. The
authors consider the quadratic loss function of the form:
l(  ) =
            Y  J
X
j=1
X
j
  j
            2
:
HereX
j
is ann  d
j
matrix corresponding to thej-th group
[33]. There are two main difﬁculties of using the above
considered method. First, it is rather hard from the compu-
tation point of view. Second, we have to know a priori the
grouping structure SNPs.
An interesting approach for dealing with correlated co-
variates was proposed by Tutz and Ulbricht [42]. Their
method utilizes the correlation between predictors expli-
citly in the penalty term. Coefﬁcients which correspond to
178 Informatica 42 (2018) 175–188 L.V . Utkin et al.
pairs of covariates are weighted according to their marginal
correlation. The correlation based penalty is given by
Q
  (  ) =  p  1
X
i=1
X
j>i
  (  i
    j
)
2
1    ij
+
(  i
+  j
)
2
1+  ij
  :
Here  ij
denotes the empirical correlation between thei-th
and thej-th predictors. If we have the positive correlation,
i.e.,  ij
! 1, then the ﬁrst term in the sum becomes domi-
nant. When  ij
!  1, then the second term becomes do-
minant. Both these cases lead to the approximate equality
  i
    j
. In case of uncorrelated predictors and  ij
! 0,
the corresponding model is reduced to the ridge regression.
Another model proposed by Park and Hastie [38] con-
structs sets of indicators representing all the available fac-
tors and all possible two-way interactions in order to ﬁt
gene-interaction models with the data consisting of ge-
notype measurements and a binary response. The obtained
grouped variables are used in the path-following algorithm
for the group-Lasso method.
In order to take into account different probabilities of fe-
ature values, in particular, to take into account the allele
frequency, Zhou et al. [55] proposed a weighted Lasso
penalty in the Lasso method such that the weights are as-
signed in accordance with the following sources of prior
knowledge. First, Zhou et al. [55] considered genotyping
errors such that the unreliable variants should be penali-
zed more. Second, they pointed out that the allele frequen-
cies can be used in accordance with an idea of Madsen and
Browning [34] where it was proposed to take the weight
w = 2
p
  (1    ) for a variant with population frequency
  by arguing that this scheme assigns smaller penalties to
rarer variants as suggested by classical population genetics
theory.
3 The proposed method
3.1 Motivation for a new penalized method
The considered in previous sections modiﬁcations of the
Lasso are efﬁcient tools for solving the GWAS and SNP se-
lection problems. Their performance has been experimen-
tally shown by many authors [13, 22, 23, 31, 33]. Howe-
ver, every real application problem has some peculiarities
whose accounting might improve the regression method.
Let us mention these peculiarities.
1. First of all, our aim is not to ﬁnd the “best” regression
model for the given information, but to select SNPs
which impact on the smallest (largest) values of the
phenotype, for example, on the heading date early ﬂo-
wering of barley in the studied applications. This does
not mean that the whole ﬁtted regression model is not
important for us. We have to combine two above aims.
This can be done by introducing the weighted Lasso
penalties of a special form. This form has to take into
account in the ﬁrst place the smallest values of phe-
notype. Smaller values of the phenotype produce lar-
ger weights, whereas larger values of the phenotype
should be also considered. It would seem that we can
assign the weights to phenotype values with respect to
their closeness to the minimal phenotype value. Ho-
wever, the phenotype values are random and depend
on environment conditions. Moreover, the smallest
phenotype value does not mean that its value is cau-
sed by the corresponding genotype. This implies that
we cannot assign weights to the available phenotype
values. The main idea underlying the method is to as-
sign weights to expected values of the phenotype with
respect to a probability mass function somehow deﬁ-
ned on the genotype values.
2. The genotype values corresponding to every SNP in
the studied application make up a binary vector. The
dependence of SNPs leads to dependence of the cor-
responding binary vectors which can be estimated.
3. The allele frequencies and correlations indirectly im-
pact on the smallest values of the phenotype.
3.2 A method for computing weights for the
Lasso
By extending the ideas proposed by aforementioned aut-
hors [34, 42, 55], we deﬁne the weighted Lasso penalty in
a new way. The main idea is the following. We deﬁne
the average contribution of every SNP to the mean phe-
notype value. These contributions or their function are no-
thing else but the weightsw
k
in the adaptive Lasso. They
have to take into account the probabilities of alleles, the
correlations between SNPs and the phenotype values. The
next question is how to determine the average contribution
of every SNP. It can be carried out as follows:
1. For every genotype vectorx
j
(thej-th line), we com-
pute joint probabilities   (x
kj
;x
ij
) of all pairs (k;i)
of SNPs by taking into account correlations between
pairs of random variables (SNPs).
2. For every pair(k;i), we compute the mean phenotype
value R
ki
as the expectation of phenotypes with re-
spect to the joint probabilities  (x
kj
;x
ij
) over all li-
nes or individuals.
3. The average contribution of every, sayk-th, SNP into
the phenotype is computed by averaging the mean
phenotype valuesR
ki
over alli = 1;:::;p.
4. The weights or their function for the adaptive Lasso
are deﬁned by the average contributions.
Below we consider every step in detail.
A Modiﬁcation of the Lasso Method by. . . Informatica 42 (2018) 175–188 179
3.3 Bahadur representation
The main idea for using the joint probability  (x
jk
;x
ji
) is
to take into account the correlation between SNPs with in-
dicesk andi. For every pair of SNPsX
k
andX
i
, we have
to determine the joint probability  (x
kj
;x
ij
),i = 1;:::;p,
i6= k, of thej-th individual. It can be computed by using
the so-called Bahadur representation proposed by Bahadur
[2]. The Bahadur representation takes into account the cor-
relation between binary variables, and it can be written in
the case of two binary variables with numbersk andi as
  (x
k
;x
i
) =  x
k
k
(1    k
)
1  x
k
    xi
i
(1    i
)
1  xi
  (1+  ki
u
k
u
i
): (2)
Here  k
is the probability of an allele for thek-th SNP
or its allele frequency, i.e.,   k
= Prfx
k
= 1g;   ki
is the
correlation coefﬁcient between thek-th and thei-th SNPs
which is deﬁned as   ki
= E[U
k
U
i
], where the random
standardized variableU
k
takes the valuesu
k
such that there
hold
U
k
=
X
k
    k
p
  k
(1    k
)
; u
k
=
x
k
    k
p
  k
(1    k
)
:
Note that the ﬁrst term in the right-hand side of the ex-
pression for  (x
k
;x
i
) represents the joint probability mass
function under condition that variablesX
k
andX
i
are sta-
tistically independent. The second term includes the inte-
raction from the ﬁrst order up to the second. Note also that
U
k
should be evaluated by estimating  k
.
The corresponding estimates of parameters   k
, u
k
,   ki
denoted asb   k
,b u
k
,b   ki
are computed by means of the follo-
wing expressions:
b   k
=
n
X
l=1
x
kl
=n; b   ki
=
n
X
l=1
b u
kl
b u
il
=n;
wheren is the number of individuals and
b u
kl
=
(x
kl
  b   k
)
p
b   k
(1  b   k
)
is thel-th observed value of variableU
k
.
It should be noted that the Bahadur representation can
be written also for joint probabilities of three, four, etc.
variables. [32] mention a property of the Bahadur repre-
sentation such that the joint probability distribution of any
subsetx
1
;x
2
;:::;x
t
can be written as follows:
  (x
1
;:::;x
t
) =
t
Y
i=1
  xi
i
(1    i
)
1  xi
  0
@
1+
X
Q f 1;:::;tg;jQj  2
  Q
Y
k2Q
u
k
1
A
:
Here   Q
represents   i1;:::;i
k
if Q = fi
1
;:::;i
k
g andjQj
denotes the number of elements inQ. The main disadvan-
tage of the Bahadur representation is the large number of
parameters and hard computations required for getting the
probabilities. Therefore, we restrict our study only by pro-
babilities of two variables.
It should be noted that the Bahadur representation has
been used in some classiﬁcation models. One of the inte-
resting models for discriminant analysis of binary data was
proposed by Lee and Jun [29]. The main contribution of
[29] is that they proposed to take into account the corre-
lation between variables or, more exactly, estimates of the
correlation by means of the Bahadur representation.
There are pros and cons of using this model when the
number of variables is larger than the number of observati-
ons. For example, Bickel and Levina [4] suppose that clas-
siﬁcation rules ignoring the correlation structure often per-
form better in this case. However, Lee and Jun [29] show
by means of various experimental studies that the correla-
tion should be taken into account in all cases.
In spite of arguments of [29] in defense of the correla-
tion analysis for high-dimensional data, there is a risk of
incorrect estimates of interactions of the large order. More-
over, it is practically impossible to compute the correspon-
ding joint probabilities when the number of SNPs is rather
large. Therefore, we propose an approach which partially
uses joint probabilities of variables and partially takes into
account the correlation between the variables.
3.4 Average contributions of SNPs
In order to determine the average contribution of thek-th
SNP into the mean value of the phenotype, we consider
all possible pairs of SNPs such that one of the SNPs in
every pair is thek-th SNP, i.e., we are interesting in consi-
deringp  1 pairs of SNPs with numbers(k;1);:::;(k;k  1);(k;k +1);:::;(k;p). Every pair, say (k;i), determines
a mean phenotype valueR
ki
corresponding to this pair of
SNPs as follows:
R
ki
=
P
n
j=1
  (x
kj
;x
ij
)y
j
P
n
j=1
  (x
kj
;x
ij
)
: (3)
In other words, we can compute the expected phenotype
value under condition that every phenotype valuey
j
is pro-
duced by the subset of the genotypes corresponding to the
k-th and thei-th SNPs. The measureR
ki
can be regarded
as a contribution of thek-th and thei-th SNPs to the mean
phenotype value.
Then the contribution of the k-th SNP denoted by
e
R
k
into the mean phenotype value can be determined through
averaging the measuresR
ki
, i.e., it is computed as
e
R
k
=
1
p  1
p
X
i=1;i6=k
R
ki
: (4)
It is obvious that the smaller values of the measure
e
R
k
give us signiﬁcant or top ranked SNPs and exert less pen-
altyw
k
, i.e., we can introduce an increasing functiong such
that
w
k
=g(1=j  init;k
j):
180 Informatica 42 (2018) 175–188 L.V . Utkin et al.
One of the possible functions which will be used in nu-
merical experiments is
w
k
=
 
e
R
k
  min
k=1;:::;p
e
R
k
max
k=1;:::;p
e
R
k
  min
k=1;:::;p
e
R
k
!
  q
: (5)
Here q is a positive real which deﬁnes how changes of
the difference between
e
R
k
and min
k=1;:::;p
e
R
k
impact on
changes of weightsw
k
. The numberq can be regarded as a
tuning parameter whose optimal value can be obtained by
means of the cross-validation procedure.
In sum, the obtained weights take into account the cor-
relation between SNPs, the allele frequencies, binary data
and the fact that the smallest (largest) values of the phe-
notype are more important in comparison with other va-
lues because we are looking for the SNPs which impact on
the values of some trait with predeﬁned properties, for ex-
ample, the heading date of barley should be as small as
possible. At the same time, we do not need to directly
use the obtained weights and to implement the adaptive
Lasso algorithm. It has been mentioned in the previous
section that the adaptive Lasso can be transformed to the
standard Lasso by means of introducing new covariates
e x
ij
=x
ij
=w
j
.
Finally, we write the following SNP selection algorithm.
Algorithm 1 The SNP selection algorithm.
Require: Y = (y
1
;:::;y
n
)
T
is the response vector (phe-
notype values),X = [X
1
;:::;X
p
] is the binary predictor
matrix (genotype values).
Ensure:   = (  1
;:::;  p
) is the vector of the regression
coefﬁcients (degrees of the SNP effect).
repeat
k 1
Compute joint probabilities  (x
jk
;x
ji
), i = 1;:::;p,
i6= k, for allj = 1;:::;n, by means of the Bahadur
representation (2)
Compute the mean phenotype valuesR
ki
, for alli =
1;:::;p,i6=k, by means of (3)
Compute the average mean phenotype value
e
R
k
by
means of (4)
Compute the weightsw
k
by means of (5)
Compute new variablese x
ik
=x
ik
=w
k
; i = 1;:::;n.
untilk>p
Compute
e
  opt
by using the standard Lasso with
e
  and
e
X
instead of
e
  andX.
Compute  k
=
e
  k
=
e
R
k
,k = 1;:::;p.
Let us indicate the main virtues of the proposed met-
hod. First of all, it does not require to develop special algo-
rithms for solving the optimization problem for computing
the vector of regression coefﬁcients  . The obtained pro-
blem is solved as the standard Lasso algorithm after refor-
mulating the penalized Lasso.
Second, the method is rather general because we could
change the weights in (5) in accordance with our goal. For
example, in one of the applications, we have aimed to mi-
nimize the mean heading date of barley as the mean phe-
notype value. However, we could aim to maximize, for ex-
ample, the amount of grain protein. In this case, we change
(5) by taking decreasing functiong as follows:
w
k
=
 
max
k=1;:::;p
e
R
k
  e
R
k
max
k=1;:::;p
e
R
k
  min
k=1;:::;p
e
R
k
!
  q
:
Here the larger values of the measure
e
R
k
give us more sig-
niﬁcant SNPs and exert less penaltyw
k
.
Third, we consider not only correlations between SNPs,
but also joint probabilities accounting for correlations. The
joint probabilities are more informative in comparison with
the correlation coefﬁcients.
Fourth, we have simpliﬁed procedures for computing the
joint probabilities. This substantially reduces the computa-
tion time.
3.5 The proposed method with epistatic
effect
A lot of studies devoted to the epistatic effect (see, for ex-
ample, [5]) consider extension of the so-called main effect
model (1) on the interaction model which can be written as
Y =
p
X
i=1
  i
X
i
+
X
i<j;i;j=1;:::;p
  ij
X
i
X
j
+  0
+ : (6)
The second term in (6) corresponds to pairwise interacti-
ons whose number isp(1  p)=2. Here  ij
is the parameter
characterizing the epistatic interaction effect of a pair SNPs
with indicesi andj. Now the weighted Lasso estimates can
be written as follows:
( ;    ) = arg min
  2R
p
kY  X    X
      k
2
+  p
X
j=1
w
j
j  j
j+  X
i<j;i;j=1;:::;p
w
ij
j  ij
j;
where    = (  12
;:::;  p  1;p
) is the additional vector cha-
racterizing the epistatic interaction effect of every pair of
SNPs; X
  = (X
1
X
2
;:::;X
p  1
X
p
) is the vector of cova-
riates corresponding to pairwise interactions;w
ij
> 0 are
weights penalizing the additional parameters   ij
, i < j,
i;j = 1;:::;p, in accordance with the rules of the adaptive
Lasso [55].
It can be seen from the previous section that the weight
or contribution of the pair of the k-th and the i-th SNPs
into the phenotype values can be determined by the mean
phenotype valueR
ki
obtained by means of (3). It is inte-
resting to note that, in contrast to the k-th SNP contribu-
tion
e
R
k
obtained in a heuristic way (5), the valueR
ki
is the
expectation of the phenotype with respect to the probabi-
lity mass function  (x
kj
;x
ij
)=
P
n
j=1
  (x
kj
;x
ij
). So, the
weightw
ij
can be directly computed as
w
ij
=
  R
ij
  min
ij
R
ij
max
ij
R
ij
  min
ij
R
ij
    q
:
A Modiﬁcation of the Lasso Method by. . . Informatica 42 (2018) 175–188 181
In order to take into account the interactions and to im-
plement the method for epistasis detection, we apply a two-
stage procedure (see the Screen and Clean method propo-
sed by Wu et al. [48] for example). The ﬁrst stage is
for constructing the main effect model and searching for
marginal signiﬁcant SNPs by using the proposed penalized
Lasso method with weightsw
k
determined from (5). Then
only top ranked SNPs and pairs of SNPs composed from
the signiﬁcant ones are used in the interaction penalized
Lasso model. The main idea here is to again use the Ba-
hadur representation, namely, the mean phenotype values
R
ki
computed by means of (3). This is a very important
place because we do not need to repeatedly compute the
mean phenotype values. They have been computed during
construction of the main effect model.
We do not provide here an algorithm for computing the
optimal vectors  and    because it is just a simple exten-
sion of the algorithm given above.
4 Numerical experiments
The Lasso method in numerical experiments is regarded as
a special case of a general problem solved by means of the
R-package “glmnet” developed by Friedman et al. [14].
The tuning parameter  is computed by using the function
cv.glmnet() with 10-fold cross validation.
Below we use indices of SNPs instead of their full titles
for short.
4.1 Data sets
Numerical experiments are carried out on three populations
of double haploid (DH) lines of barley:
1. The ﬁrst dataset consists of 93 DH lines of barley
described in [8] and [9]. Phenotyping and genoty-
ping data are available at Oregon Wolfe Barley Data
(OWBD) and GrainGenes Tools. The lines are ana-
lyzed with respect to seven phenotypic traits: spike
length (SL) in cm; grain number (GN); ﬂoret number
(FS); hundred grain weight (HGW) in g of 100 grains;
plant height (PH) in cm; spike number (SN); heading
date (HD) in days. The linkage map consists of 1328
markers (SNPs).
2. The second dataset consists of 92 DH lines of
barley obtained from the Dicktoo x Morex cross
and described by several authors [20, 19, 37].
Phenotyping and genotyping data are available at
http://wheat.pw.usda.gov/ggpages/DxM/ . We ana-
lyze the lines with respect to two phenotypic traits:
heading date with and without vernalization with an
8-h light/16-h dark photoperiod regime. The linkage
map consists of 117 markers.
3. The third population dataset includes 150 DH lines
of barley obtained from the Steptoe x Morex cross
[10, 21]. Phenotyping and genotyping data are avai-
lable at http://wheat.pw.usda.gov/ggpages/SxM. The
linkage map consists of 223 markers. The lines are
analyzed with respect to the heading date (HD) trait,
which is measured in 16 environments, and grain yield
(GY) trait, which is measured in 6 environments.
4.2 Missing data
Missing marker data in all the datasets are estimated by
means of the following heuristic procedure which can be
regarded as some modiﬁcation of the well-known method
ofK-nearest neighbors. Suppose the vectorX
i
correspon-
ding to thei-th SNP has a missing value at thek-th position,
i.e., X
ik
is missing. By using the speciﬁc Hamming dis-
tance between the vectorX
i
and all vectorsX
j
,j = 1;:::p,
j 6= i, we select K nearest neighbors X
i1
;:::;X
i
K
or K
closest vectors. In order to take into account the missing
values, they are excluded from computing the Hamming
distance. That is why we use the speciﬁc Hamming dis-
tance in order to compare vectors with different numbers
of missing elements, i.e., we compute the distance per one
element of X
i
. The imputed value is that represents the
maximum of the K values at the k-th position of all the
nearest neighborsX
i1k
;:::;X
i
K
k
.
4.3 Error measure
From each of the (synthetic or real) data sets we randomly
select two distinct subsets: a training data set ofn exam-
ples to learn the model, and a test data set ofn
test
instances
to evaluate the performance of the algorithms. The perfor-
mance is assessed by means of the mean square residual
(RMSR), which is deﬁned by
RMSR =
P
ntest
i=1
(y
i
  ^
f(x
i
))
2
n
test
;
where
^
f is the function estimated by the proposed method,
and
^
f(x
i
) is the predicted value of the phenotype valuey
i
for each i 2 f1;:::;n
test
g. The error measure RMSR
is computed from repeatedly random drawing training and
test data sets and by averaging over the runs. The smal-
ler the values of the average error measure are, the bet-
ter the corresponding method. We use the one-fold-cross-
validation, i.e., n
test
= 1. This is because the number of
lines is very small in comparison with the number of SNPs
and we cannot reduce them.
4.4 The ﬁrst dataset
First, we investigate DH lines of barley from OWBD. Va-
lues of the RMSR for the ﬁrst dataset are shown in Table
1, where the ﬁrst column corresponds to seven traits ana-
lyzed, columns 2-5 illustrate the RMSR by using only 40
top ranked SNPs. At that, we study cases when the accu-
racy is determined for all lines (All lines) and for the ﬁrst
10 lines with the smallest values of phenotypes (First 10
182 Informatica 42 (2018) 175–188 L.V . Utkin et al.
lines). Abbreviations S.L. and P.L. denote the standard and
new proposed Lasso methods, respectively. One can see
that the proposed method provides better accuracy for the
most traits. It does not mean that it can be successful in
all cases. It is seen from Table 1 that the proposed met-
hod by traits PH and HD does not outperform the standard
Lasso. Perhaps, another function determining the weights
w
k
fromR
k
could provide better results, but we did not ﬁnd
it. In addition, we can observe from Table 1 that use only
of top ranked SNPs gives outperforming results in compa-
rison with taking all SNPs for modelling GWAS. The same
can be said about considering all lines and the ﬁrst 10 lines.
Table 2 illustrates how the error measures depend on the
reduced number of top ranked SNPs which are used for
constructing the GWAS for the spike length trait. We take
the ﬁxed value of q = 0:25. It can be seen from Table 2
that the optimal number of top ranked SNPs is 40. It is
interesting to observe also that the standard Lasso weakly
depends on the SNP number.
Table 3 is similar to Table 1, but RMSRs in Table 3 are
obtained by taking into account the epistatic effect. By
comparing Tables 1 and 3, we can see that the consideration
of epistasis allows us to construct a more accurate model.
Moreover, the proposed method outperforms the standard
Lasso even for traits PH and HD which distinguished from
other traits and illustrated worse results with the proposed
method (see Table 1). This is a very important fact showing
that joint probabilities of pairs of SNPs as well as correla-
tions between SNPs may improve the GWAS.
Table 4 shows the top ranked SNPs or their pairs with
the largest 10 weights  obtained by means of the standard
Lasso and the proposed method. Moreover, Table 4 shows
the chromosomes where the corresponding SNPs are loca-
ted. One can see that the largest weight has a pair of SNPs
997  1279. This implies that impact of the epistatic effect
is very signiﬁcant. It is interesting to note that the both met-
hods select this pair of SNPs as the most signiﬁcant one.
4.5 The second dataset
Let us study the dataset consisting of 92 DH lines of bar-
ley obtained from the Dicktoo x Morex cross. Tables 5 and
6 contain error measures for the Dicktoo x Morex dataset
by considering two traits mentioned above. At that, Table
5 is obtained without taking into account the epistatic ef-
fect. In Table 6, the results are represented under condition
of epistasis. Comparison of the tables shows that the use
of condition of epistasis allows us to get outperforming re-
sults.
Table 7 shows the top ranked SNPs or their pairs with
the largest 10 weights  obtained by means of the standard
Lasso and the proposed method for the heading date wit-
hout vernalization.
4.6 The third dataset
The third dataset consists of 150 DH lines of barley obtai-
ned from the Steptoe x Morex cross. Values of the RMSR
for the third dataset are shown in Table 8. It can be seen
from the table that the proposed method provides outper-
forming results. Table 9 shows also reduced values of the
RMSR for the case of taking into account the epistatic ef-
fect. Comparing Tables 8 and 9, we can conclude that the
model taking into account the epistatic effect signiﬁcantly
improves the regressor accuracy when the model is con-
structed by using only 40 top ranked SNPs. Moreover, the
standard Lasso method also shows better results when the
epistatic effect is considered.
It should be noted that the results given in Tables 8 and
9 is obtained for a certain value ofq, namely, forq = 0:8.
However, it is interesting to analyze how the valueq impact
on numerical results by using the third dataset. Figs. 1-4
depict the difference D between RMSRs of the proposed
and standard Lasso methods for the HD trait. The larger
the values ofD are, the better the corresponding proposed
method. The positive values of D say that the proposed
method outperforms the standard Lasso for the correspon-
ding values ofq. It can be seen from Figs. 1-4 that there is
an optimal value ofq for every condition of the model use
such thatD achieves its maximum at thisq. For example,
it follows from Fig. 1 that the best results by using only
top ranked SNPs can be obtained by q = 0:8. If we use
all SNPs and analyze the ﬁrst examples, then the optimal
value ofq is0:5 (see Fig. 2). The same conclusions can be
inferred from pictures illustrating the methods taking into
account the epistatic effect (see Figs. 3-4).
Table 10 shows the top ranked SNPs and their pairs with
the largest 10 weights  obtained by means of the standard
Lasso and the proposed method for the grain yield trait.
It is interesting to note that the use oft-statistics for com-
puting weights  of SNPs by the same parameters for GY
trait gives the following 10 top ranked SNPs:
82 81 83 84 85 79
86 130 129 80 :
One can see that the most top ranked SNPs concentrated
around the SNP with index 82. This is the obvious inte-
raction of genes in a group of SNPs located at the same
chromosome.
5 Conclusion
The results of numerical experiments and the logic under-
lying the proposed method have demonstrated that the pro-
posed method outperforms the standard Lasso for many
real datasets. Moreover, it takes into account the epista-
tic effect or the SNP-SNP interaction. It should be noted
that the proposed method is very simple from a computa-
tion point of view. It does not require to develop a special
software. The standard software (package “glmnet” in R)
can be used for the method.
A Modiﬁcation of the Lasso Method by. . . Informatica 42 (2018) 175–188 183
Figure 1: Difference between RMSRs of the standard and
proposed Lasso methods for top SNPs.
It can be seen from the Bahadur representation that one
of the crucial elements of the proposed method is a set of
correlation coefﬁcients between SNPs. It should be noted
that they often use in GWAS as additional information. Ho-
wever, the correlation coefﬁcients do not contain all proba-
bilistic information about impacts of SNPs on values of a
phenotype. The joint probabilities taking into account the
correlation between SNPs can be viewed as a way for con-
structing association between SNPs and traits.
We have analyzed DH populations of barley. According
to the DH method, only two types of genotypes occur for a
pair of alleles, i.e., everyx
ij
takes only two values. At the
same time, in diploid method, three genotypes occur, i.e.,
everyx
ij
takes three values. In this case the Bahadur repre-
sentation cannot be applied, but the Sarmanov-Lancaster
expansion [26, 39] can be used
1
. This is a direction for
further research.
Of course, we have used a heuristic procedure by taking
pairs of SNPs for computing R
ki
by means of (3). We
could consider joint probabilities of three and more SNPs.
However, the increase of SNP numbers for computing the
joint probabilities is impossible when the total number of
SNPs is rather large. In this way, we can propose a multi-
step procedure when the large set of top ranked SNPs is
consequently determined by computing the joint probabi-
lities of SNP pairs at the ﬁrst step, then by computing the
joint probabilities of SNP triples but from the reduced set
obtained at the previous step. This procedure can be con-
tinued. At that, we could use the ridge regression in order
to avoid a situation when a very small number of SNPs are
obtained at some step. However, this is a direction for furt-
her research. The above modiﬁcation may be very useful
when the number of lines or individuals is small.
1
A rather simple presentation of the Sarmanov-Lancaster expansion
and its usage can be found in the paper I. Goodman and D.H. Johnson,
Multivariate dependence and the Sarmanov-Lancaster expansion, 2005,
http://www-ece.rice.edu/~igoodman/papers/goodman-johnson05.pdf
Figure 2: Difference between RMSRs of the standard and
proposed Lasso methods for all SNPs.
Figure 3: Difference between RMSRs of the standard and
proposed Lasso methods for top ranking SNPs with epista-
sis.
Figure 4: Difference between RMSRs of the standard and
proposed Lasso methods for all SNPs with epistasis.
184 Informatica 42 (2018) 175–188 L.V . Utkin et al.
Table 1: RMSRs for the standard and proposed Lasso for OWBD.
Top ranked SNPs All SNPs
All lines First 10 lines All lines First 10 lines
Trait S.L. P.L. S.L. P.L. S.L. P.L. S.L. P.L.
SL 1:760 0:583 1:596 0:515 3:280 4:429 2:594 3:845
GN 75:62 71:62 61:16 57:96 140:6 139:3 103:8 110:0
FS 77:86 79:59 50:39 40:78 162:0 158:9 75:68 74:60
HGW 0:134 0:121 0:094 0:085 0:209 0:186 0:147 0:137
PH 24:92 24:92 16:23 16:23 237:5 237:5 147:5 147:5
SN 17:96 16:53 8:847 8:285 27:71 27:92 12:73 12:66
HD 31:08 31:08 29:12 29:12 130:3 130:3 82:50 82:50
Table 2: RMSRs for the standard and proposed Lasso for OWBD by different numbers of top ranked SNPs.
All lines First 10 lines
SNP numbers S.L. P.L. S.L. P.L.
20 1:412 1:116 1:396 1:177
40 1:392 0:550 1:392 0:443
60 1:393 0:680 1:393 0:611
80 1:393 1:548 1:393 1:569
Table 3: RMSRs for the standard and proposed Lasso for OWBD with epistasis.
Top ranked SNPs All SNPs
All lines First 10 lines All lines First 10 lines
Trait S.L. P.L. S.L. P.L. S.L. P.L. S.L. P.L.
SL 0:815 0:724 0:697 0:652 2:866 3:638 2:622 2:630
GN 51:67 52:97 45:25 42:23 85:15 86:13 77:27 77:43
FS 44:99 70:42 33:19 27:53 72:00 102:1 51:17 50:86
HGW 0:077 0:056 0:069 0:052 0:156 0:107 0:118 0:098
PH 46:08 33:69 43:13 39:97 254:2 247:0 219:8 249:1
SN 12:21 10:42 6:842 5:274 24:26 23:84 13:34 10:71
HD 31:08 29:05 29:12 25:79 130:3 133:4 82:50 78:47
Table 4: Top ranked SNPs and their weights for the standard and proposed Lasso for OWBD HD with epistasis.
S.L. P.L.
SNP chromosome   SNP chromosome   997  1279 6  6 3:421 997  1279 6  6 3:401
138 1 3:314 903  325 5  2 3:193
896 5 3:176 1101 6   2:764
1101  1152 6  6 2:750 138 1 2:661
734 4   2:683 896 5 2:634
670  273 4  2   2:542 1101  1152 6  6 2:583
324  903 2  5 2:128 725 4   2:493
1096  976 6  6   1:877 670  273 4  2   2:012
997  526 6  3 1:826 734 4   1:629
903  325 5  2 1:706 1101  976 6  6   1:447
Table 5: RMSRs for the standard and proposed Lasso for Dicktoo-Morex without epistasis.
Top ranked SNPs All SNPs
All lines First 10 lines All lines First 10 lines
Trait S.L. P.L. S.L. P.L. S.L. P.L. S.L. P.L.
unvernalized 43:74 44:68 18:13 17:87 63:57 60:77 26:57 27:43
vernalized 79:56 78:37 27:95 26:49 108:3 106:7 26:46 25:81
A Modiﬁcation of the Lasso Method by. . . Informatica 42 (2018) 175–188 185
Table 6: RMSRs for the standard and proposed Lasso for Dicktoo-Morex with epistasis.
Top ranked SNPs All SNPs
All lines First 10 lines All lines First 10 lines
Trait S.L. P.L. S.L. P.L. S.L. P.L. S.L. P.L.
unvernalized 34:33 38:85 24:08 17:63 58:63 58:28 30:46 27:81
vernalized 38:41 62:26 17:28 17:11 124:0 119:8 33:32 32:14
Table 7: Top ranked SNPs and their weights for the standard and proposed Lasso for Dicktoo-Morex HD with epistasis.
S.L. P.L.
SNP chromosome   SNP chromosome   112 7   7:860 112 7   7:238
22 2 6:387 110 7   5:242
110 7   5:296 20 2 4:426
20 2 4:074 22 2 4:377
113 7   2:26 113 7   3:55
51 3   1:684 21 2 2:645
59  84 4  5   1:394 50 3   2:398
84 5   1:261 49  113 3  7 1:627
19 2 1:002 84 5   1:174
49 3   0:980 33  50 2  3   1:037
Table 8: RMSRs for the standard and proposed Lasso for Steptoe-Morex without epistasis.
Top ranked SNPs All SNPs
All lines First 10 lines All lines First 10 lines
Trait S.L. P.L. S.L. P.L. S.L. P.L. S.L. P.L.
HD 79:05 41:95 78:68 43:73 46:62 51:37 57:49 46:66
GY 104:0 78:99 95:81 74:24 137:3 140:4 151:3 163:9
Table 9: RMSRs for the standard and proposed Lasso for Steptoe-Morex without epistasis.
Top ranked SNPs All SNPs
All lines First 10 lines All lines First 10 lines
Trait S.L. P.L. S.L. P.L. S.L. P.L. S.L. P.L.
HD 77:15 40:31 78:52 39:14 44:12 51:52 50:15 47:26
GY 87:48 23:00 78:63 18:61 156:5 161:8 192:8 171:1
Table 10: Top ranked SNPs and their weights for the standard and proposed Lasso for the Steptoe-Morex GY with
epistasis.
S.L. P.L.
SNP chromosome   SNP chromosome   82 3 9:562 82 3 13:158
53 2   7:067 53 2   6:687
81 3 5:278 222  114 7  4 5:355
29 1   4:62 29 1   5:089
42  53 2  2 4:126 68 2 4:599
20 1   3:693 20 1   3:543
111 4 3:510 108 4 3:338
68 2 2:377 154  19 5  1   3:156
72 2 2:348 154  45 5  2 3:028
203 7   1:675 108  105 4  3 2:881
186 Informatica 42 (2018) 175–188 L.V . Utkin et al.
References
[1] W. Altidor, T.M. Khoshgoftaar, J. Van Hulse, and
A. Napolitano (2011) Ensemble feature ranking met-
hods for data intensive computing applications. In
B. Furht and A. Escalante, editors, Handbook of Data
Intensive Computing, pages 349–376. Springer, New
York.
[2] R.R. Bahadur (1961) A representation of the joint
distribution of response to n dichotomous items. In
H. Solomon, editor, Studies in Item Analysis and Pre-
diction, pages 158–168. Stanford University Press,
Palo Alto, CA.
[3] A.L. Beam, A. Motsinger-Reif, and J. Doyle (2014)
Bayesian neural networks for detecting epistasis
in genetic association studies. BMC Bioinformatics,
15(368):1–12.
[4] P.J. Bickel and E. Levina (2004) Some theory for ﬁs-
her’s linear discriminant function, ’naive bayes’, and
some alternatives when there are many more variables
than observations. Bernoulli, 10(6):989–1010.
[5] J. Bocianowski (2014) Estimation of epistasis in dou-
bled haploid barley populations considering interacti-
ons between all possible marker pairs. Euphytica,
196(1):105–115.
[6] P. Buhlmann and S. van de Geer (2011) Statistics for
High-Dimensional Data: Methods, Theory and Ap-
plications. Springer Series in Statistics. Springer, Ber-
lin Heidelberg.
[7] L. Chen, G. Yu, C.D. Langefeld, D.J. Miller, R.T.
Guy, J. Raghuram, X. Yuan, D.M. Herrington, and
Y . Wang (2011) Comparative analysis of methods for
detecting interacting loci. BMC Genomics, 12:344:1–
23.
[8] Y . Chutimanitsakun, R.W. Nipper, A. Cuesta-Marcos,
L. Cistue, A. Corey, T. Filichkina, E.A. Johnson, and
P.M. Hayes (2011) Construction and application for
qtl analysis of a restriction site associated dna (rad)
linkage map in barley. BMC Genomics, 12:4:1–13.
[9] L. Cistue, A. Cuesta-Marcos, S. Chao, B. Echa-
varri, Y . Chutimanitsakun, A. Corey, T. Filichkina,
N. Garcia-Marino, I. Romagosa, and P.M. Hayes
(2011) Comparative mapping of the oregon wolfe bar-
ley using doubled haploid lines derived from female
and male gametes. Theoretical and applied genetics,
122(7):1399–1410.
[10] T.J. Close, P.R. Bhat, S. Lonardi, Y . Wu, N. Ros-
toks, L. Ramsay, A. Druka, N. Stein, J.T. Svensson,
S. Wanamaker, S. Bozdag, M.L. Roose, M.J. Moscou,
S. Chao, R.K. Varshney, P. Szucs, K. Sato, P.M.
Hayes, D.E. Matthews, A. Kleinhofs, G.J. Muehl-
bauer, J. DeYoung, D.F. Marshall, K. Madishetty,
R.D. Fenton, P. Condamine, A. Graner, and R. Waugh
(2009) Development and implementation of high-
throughput SNP genotyping in barley. BMC Geno-
mics, 10:582:1–13.
[11] H.J. Cordell (2002) Epistasis: what it means, what it
doesn’t mean, and statistical methods to detect it in
humans. Human Molecular Genetics, 11(20):2463–
2468.
[12] G. de los Campos, D. Gianola, and D.B. Allison
(2010) Predicting genetic predisposition in humans:
the promise of whole-genome markers. Nature Re-
views Genetics, 11(12):880–886.
[13] Z. Feng, X. Yang, S. Subedi, and P.D. McNicholas
(2012) The LASSO and sparse least squares regres-
sion methods for SNP selection in predicting quanti-
tative traits. IEEE/ACM Trans. Comput. Biol. Bioin-
formatics, 9(2):629–636.
[14] J.H. Friedman, T. Hastie, and R. Tibshirani (2010)
Regularization paths for generalized linear models via
coordinate descent. Journal of Statistical Software,
33(1):1–22.
[15] M.E. Goddard, N.R. Wray, K. Verbyla, and P.M.
Visscher (2009) Estimating effects and making pre-
dictions from genome-wide marker data. Statistical
Science, 24(4):517–529.
[16] X. Gu, G. Yin, and J.J. Lee (2013) Bayesian two-step
Lasso strategy for biomarker selection in personalized
medicine development for time-to-event endpoints.
Contemporary Clinical Trials, 36(2):642 – 650.
[17] I. Guyon, J. Weston, S. Barnhill, and V . Vapnik (2002)
Gene selection for cancer classiﬁcation using support
vector machines. Machine Learning, 46:389–422.
[18] B. Hayes (2013) Overview of statistical methods for
genome-wide association studies (GWAS). Methods
in Molecular Biology, 1019:149–169.
[19] P. Hayes, F.Q. Chen, A. Corey, A. Pan, T.H.H. Chen,
E. Baird, W. Powell, W. Thomas, R. Waugh, Z. Bedo,
I. Karsai, T. Blake, and L. Oberthur (1997) The dick-
too x morex population. In PaulH. Li and TonyH.H.
Chen, editors, Plant Cold Hardiness, pages 77–87.
Springer US.
[20] P.M. Hayes, T. Blake, T.H.H. Chen, S. Tragoonrung,
F. Chen and.A. Pan, and B. Liu (1993) Quantitative
trait loci on barley (Hordeum vulgare L.) chromo-
some 7 associated with components of winterhardi-
ness. Genome, 36(1):66–71.
[21] P.M. Hayes and O. Jyambo (1993) Summary of QTL
effects in the steptoe x morex population. Barley ge-
netics newsletter, 23:98–143.
A Modiﬁcation of the Lasso Method by. . . Informatica 42 (2018) 175–188 187
[22] A. Huang, S. Xu, and X. Cai (2013) Empirical Bay-
esian LASSO-logistic regression for multiple binary
trait locus mapping. BMC Genetics, 14(5):1–14.
[23] O. Kohannim, D.P. Hibar, J.L. Stein, N. Jahanshad,
Xue Hua, P. Rajagopalan, A.W. Toga, C.R. Jack Jr.,
M.W. Weiner, G.I. de Zubicaray, K.L. McMahon,
N.K. Hansell, N.G. Martin, M.J. Wright, P.M.
Thompson, and The Alzheimer’s Disease Neuroima-
ging Initiative (2012) Discovery and replication of
gene inﬂuences on brain structure using LASSO re-
gression. Frontiers in Neuroscience, 6:1–13.
[24] R. Kohavi and G.H. John (1997) Wrappers for feature
subset selection. Artiﬁcial Intelligence, 97(1-2):273–
324.
[25] T.N. Lal, O. Chapelle, J. Weston, and A. Elisseeff
(2006) Embedded methods. In Feature extraction, vo-
lume 207 of Studies in Fuzziness and Soft Computing,
pages 137–165. Springer, Berlin Heidelberg.
[26] H.O. Lancaster (1958) The structure of bivariate dis-
tributions. The Annals of Mathematical Statistics,
29(3):719–736.
[27] E.S. Lander and D. Botstein (1989) Mapping mende-
lian factors underlying quantitative traits using RFLP
linkage maps. Genetics, 121(1):185–199.
[28] I.-H. Lee, G.H. Lushington, and M. Visvanathan
(2011) A ﬁlter-based feature selection approach for
identifying potential biomarkers for lung cancer.
Journal of Clinical Bioinformatics, 1(11):1–8.
[29] S.-H. Lee and C.-H. Jun (2011) Discriminant ana-
lysis of binary data following multivariate Ber-
noulli distribution. Expert Systems with Applications,
38(6):7795–7802.
[30] J. Li, B. Horstman, and Y . Chen (2011) Detecting epi-
static effects in association studies at a genomic le-
vel based on an ensemble approach. Bioinformatics,
27(13):i222–i229.
[31] Z. Li and M.J. Sillanpaa (2012) Overview of LASSO-
related penalized regression methods for quantitative
trait mapping and genomic selection. Theoretical and
Applied Genetics, 125(3):419–435.
[32] B.G. Lindsay, G.Y . Yi, and J. Sun (2011) Issues and
strategies in the selection of composite likelihoods.
Statistica Sinica, 21(1):71–105.
[33] J. Liu, J. Huang, S. Ma, and K. Wang (2013) Incorpo-
rating group correlations in genome-wide association
studies using smoothed group LASSO. Biostatistics,
14(2):205–219.
[34] B.E. Madsen and S.R. Browning (2009) A groupwise
association test for rare mutations using a weighted
sum statistic. PLoS genetics, 5(2):e1000384.
[35] C.M. Mutshinda and M.J. Sillanpaa (2010) Extended
bayesian LASSO for multiple quantitative trait loci
mapping and unobserved phenotype prediction. Ge-
netics, 186(3):1067–1075.
[36] J.O Ogutu, T. Schulz-Streeck, and H.-P. Piepho
(2012) Genomic selection using regularized linear re-
gression models: ridge regression, lasso, elastic net
and their extensions. BMC Proceedings, 6(2):1–10.
[37] A. Pan, P.M. Hayes, F. Chen, T.H.H. Chen, T. Blake,
S. Wright, I. Karsai, and Z. Bedo (1994) Genetic ana-
lysis of the components of winterhardiness in barley
(Hordeum vulgare L.). Theoretical and Applied Ge-
netics, 89(7-8):900–910.
[38] M.Y . Park and T. Hastie (2006) Regularization path
algorithms for detecting gene interactions. Technical
Report 2006-13, Department of Statistics, Stanford
University.
[39] O.V . Sarmanov (1958) The maximum correlation
coefﬁcient (symmetric case). Dokl. Akad. Nauk SSSR,
120:715–718.
[40] S. Subedi, Z. Feng, R. Deardon, and F.S. Schen-
kel (2013) SNP selection for predicting a quantitative
trait. Journal of Applied Statistics, 40(3):600–613.
[41] R. Tibshirani (1996) Regression shrinkage and se-
lection via the Lasso. Journal of the Royal Statistical
Society. Series B (Methodological), 58(1):267–288.
[42] G. Tutz and J. Ulbricht (2009) Penalized regression
with correlation-based penalty. Statistics and Compu-
ting, 19(3):239–253.
[43] M.G Usai, A. Carta, and S. Casu (2012) Alter-
native strategies for selecting subsets of predicting
SNPs by LASSO-LARS procedure. BMC Procee-
dings, 6(2):1–9.
[44] X. Wan, C. Yang, Q. Yang, H. Xue, X. Fan, N.L.S.
Tang, and W. Yu (2010) BOOST: A fast approach
to detecting gene-gene interactions in genome-wide
case-control studies. The American Journal of Human
Genetics, 87(3):325–340.
[45] Y . Wang, G. Liu, M. Feng, and L. Wong (2011)
An empirical comparison of several recent epista-
tic interaction detection methods. Bioinformatics,
27(21):2936–2943.
[46] H. Warren, J.-P. Casas, A. Hingorani, F. Dudbridge,
and J. Whittaker (2014) Genetic prediction of quan-
titative lipid traits: Comparing shrinkage models to
gene scores. Genetic Epidemiology, 38(1):72–83.
[47] N.R. Wray, Jian Yang, B.J. Hayes, A.L. Price, M.E.
Goddard, and P.M. Visscher (2013) Pitfalls of pre-
dicting complex traits from SNPs. Nature Reviews
Genetics, 14:507–515.
188 Informatica 42 (2018) 175–188 L.V . Utkin et al.
[48] J. Wu, B. Devlin, S. Ringquist, M. Trucco, and
K. Roeder (2010) Screen and clean: a tool for iden-
tifying interactions in genome-wide association stu-
dies. Genetic Epidemiol., 34(3):275–285.
[49] J. Xia, S. Visweswaran, and R.E. Neapolitan (2012)
Mining epistatic interactions from high-dimensional
data sets. In D.E. Holmes and L.C. Jain, editors, Data
Mining: Foundations and Intelligent Paradigms, pa-
ges 187–209. Springer, Berlin Heidelberg.
[50] P. Yang, J. Ho, A. Zomaya, and B. Zhou (2010) A
genetic ensemble approach for gene-gene interaction
identiﬁcation. BMC Bioinformatics, 11(1):524.
[51] C. Yao, D.M. Spurlock, L.E. Armentano, C.D. Page
Jr., M.J. VandeHaar, D.M. Bickhart, and K.A. Weigel
(2013) Random forests approach for identifying ad-
ditive and epistatic single nucleotide polymorphisms
associated with residual feed intake in dairy cattle.
Journal of Dairy Science, 96(10):6716–6729.
[52] N. Yi, B.S. Yandell, G.A. Churchill, D.B. Allison,
E.J. Eisen, and D. Pomp (2005) Bayesian model se-
lection for genome-wide epistatic quantitative trait
loci analysis. Genetics, 170(3):1333–1344.
[53] X. Zhang, S. Huang, Z. Zhang, and W. Wang (2012)
Chapter 10: Mining genome-wide genetic markers.
PLoS Computational Biology, 8(12):e1002828.
[54] Y . Zhang, B. Jiang, J. Zhu, and J.S. Liu (2011) Bay-
esian models for detecting epistatic interactions from
genetic data. Annals of Human Genetics, 75(1):183–
193.
[55] H. Zhou, D.H. Alexander, M.E. Sehl, J.S. Sinshei-
mer, and K. Lange (2011) Penalized regression for
genome-wide association screening of sequence data.
In Paciﬁc Symposium on Biocomputing, pages 106–
117. World Scientiﬁc Publishing.
[56] H. Zou (2006) The adaptive Lasso and its oracle pro-
perties. Journal of the American Statistical Associa-
tion, 101(476):1418–1429.
[57] H. Zou and T. Hastie (2005) Regularization and va-
riable selection via the elastic net. Journal of the
Royal Statistical Society: Series B (Statistical Met-
hodology), 67(2):301–320.