Algorithm 1: DG ( Duality Gap )

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝⁿ $⊳$ Current β
λ ∈ ℝ $⊳$ Grid element

1: ϵ ← Xβ − Y
2: f_β ←||ϵ||22 + λ||β||1 $⊳$ Primal Objective function
3: α ← ---λ---- ||2XT ϵ||∞

4: α₀ ← Y ^Tϵ
5:

← min{max{α,α₀},−α} $⊳$ Dual Point
6:

←

ϵ +

Y
7: d_ν ← 1 4

λ²||

||2 2 −||Y ||2 2 $⊳$ Dual Objective function

return f_β + d_ν

Algorithm 2: DGT ( Duality Gap Target )

Input:

γ ∈ ℝ $⊳$ …
C ∈ ℝ $⊳$ …
r_StatsIt ∈ ℕ $⊳$ Index of the current grid element ( outer loop iteration number )
n ∈ ℕ $⊳$ Number of rows in the design matrix X ∈ ℝ^n×p

1: dgt ← γC² r2StatsIt --n---

return dgt

Algorithm 3: f_β

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝⁿ $⊳$ Current β

1: f ← Xβ − Y .

return ||f||22

Algorithm 4: f

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝⁿ $⊳$ The k’th β vector
β′∈ ℝⁿ $⊳$ The k − 1’th β vector
L ∈ ℝ $⊳$ The current Lipschitz constant, as computed by backtracking line search

1: f ← Xβ − Y .
2: t₀ ←||f||22
3: ∇f ← 2X^Tf
4: Δ_β ← β − β′
5: t₁ ←∇f^TΔ_β
6: t₂ ← L- 2

||Δ_β||2 2

return t₀ + t₁ + t₂

Algorithm 5: τ ( Matrix-wise Soft-Thresholding / Proximal Operator )

Input:

X ∈ ℝ^n×m $⊳$ An arbitrary matrix
λ ∈ ℝ $⊳$ The thresholding parameter

← X $⊳$ Make a copy of X.
2: for

_i,j ∈

do
3:

_i,j ← sign(

_i,j)

⁺
4: end for

return

Algorithm 6: ISTA with backtracking line search and duality gap convergence criteria

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝⁿ $⊳$ Starting vector
L₀ ∈ ℝ $⊳$ Initial Lipschitz constant, used by backtracking line search
λ ∈ ℝ $⊳$ Grid element
η ∈ ℝ $⊳$ Step size when updating Lipschitz constant

∈ ℝ $⊳$ Duality gap target

← β $⊳$ Make a copy of β that will be updated during back tracking.
2: do
3: ^ β

← τ(β −

∇f(X,Y,

,L))
4: while f_β(X,Y,

) > f(X,Y,

,β_,L) do
5: L ← ηL
6:

← τ(β −

∇f(X,Y,β))
7: end while
8: β ← τ(β −

∇f(X,Y,β,L)) $⊳$ Update β once L is sufficiently large.
9: while DG (X,Y,β,λ) >

10:

return β

Algorithm 7: Coordinate Descent with duality gap convergence criteria

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝⁿ $⊳$ Starting vector
λ ∈ ℝ $⊳$ Grid element

∈ ℝ $⊳$ Duality gap target

← β $⊳$ Make a copy of β
2: do
3: for i ∈ 1,2,…,p do
4: t ← 2|λ|X-||2 i 2

$⊳$ Scale grid element by norm of the i’th column of design matrix
5: X_−i ← X_∀j≠i $⊳$ Take all columns of design matrix not equal to i
6:

_−i ←

_∀j≠i $⊳$ Take all elements of predictors vectors not equal to i
7: r ← XTi (Y−X-−iβ^−-i) ||Xi||22

$⊳$ Compute the scaled residual
8:

_i ← τ

$⊳$ Update the i’th element of Beta
9: end for
10: while DG (X,Y,

,λ) >

11:

return ^
β

Algorithm 8: Coordinate Descent with Lazy Evaluation

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝⁿ $⊳$ Starting vector
λ ∈ ℝ $⊳$ Grid element

∈ ℝ $⊳$ Duality gap target

← β $⊳$ Make a copy of β
2: R ← Y − X ^ β

$⊳$ Initialize Intermediary Residual
3: do
4: for i ∈ 1,2,…,p do
5: t ← --λ--- 2||Xi||22

$⊳$ Scale grid element by norm of the i’th column of design matrix
6: if

_i≠0 then R ← R + X_i

_i
7: end if
8: ^ β

_i ← τ

$⊳$ Update the i’th element of Beta
9: if

_i≠0 then R ← R − X_i

_i
10: end if
11: end for
12: while DG (X,Y,

,λ) >

13:

return

Algorithm 9: Coordinate Descent with standardized data

Input:

X ∈ ℝ^n×p $⊳$ The standardized X_i = 0,σ_{X_i} = 1 design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝⁿ $⊳$ Starting vector
λ ∈ ℝ $⊳$ Grid element

∈ ℝ $⊳$ Duality gap target

← β $⊳$ Make a copy of β
2: do
3: for i ∈ 1,2,…,p do
4: t ← 2λn

$⊳$ Scale grid element by norm of the i’th column of design matrix
5: X_−i ← X_∀j≠i $⊳$ Take all columns of design matrix not equal to i
6:

_−i ←

_∀j≠i $⊳$ Take all elements of predictors vectors not equal to i
7: r ← XTi (Y−X-−iβ^−-i) n

$⊳$ Compute the scaled residual
8:

_i ← τ

$⊳$ Update the i’th element of Beta
9: end for
10: while DG (X,Y,

,λ) >

11:

return ^
β

The next algorthim is a modified version of Coordinate Descent that seeks to reduce redunant computations as much as possible. This algorthim relies on the fact that many computation required by Coordinate Descent can be broken up into constant and non-constant parts. The constant parts of the computation can be performed ahead of time and stored for later use.

Of particular note is the scaled residual computation, which when written down naively reads:

Note that since the design matrix X and the vector of predictors Y are fixed, the terms T
-Xi Y2
||Xi||2

and

do not change as the values of

are updated. Our strategy will be to compute these values for each column of X and store them in an array of size p, from which they will be access as

is updated.

Note for this algorthim we establish the convention that array of a given data type will be declared as follows:

As an example an array of real numbers numbers of size n ∈ ℕ would be written as:

Algorithm 10: Coordinate Descent with Minimal Data Copying

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝⁿ $⊳$ Starting vector
λ ∈ ℝ $⊳$ Grid element

∈ ℝ $⊳$ Duality gap target

1: ℷ ← (ℝ)[p] $⊳$ Initialize array of size p to hold threshold parameters
2: p₁ ∈ (ℝ)[p] $⊳$ Blank array for part of residual computation
3: p₂ ∈ (ℝ^1×p)[p] $⊳$ Blank array to row vectors of size p which will be used for part of residual computation
4: for i ∈ 1,2,…,p do
5: ℶ ← --1-- ||Xi||22

6: ℷ[i] ←

ℶ
7: p₁[i] ← ℶX_i^TY
8: p₂[i] ← ℶX_i^TX
9: end for
10:

← β $⊳$ Make a copy of β
11: do
12: for i ∈ 1,2,…,p do
13:

←

_j if j≠i else 0 $⊳$ Copy all of beta expect i’th element which is assigned to 0
14: r ← p₁[i] − p₂[i]

$⊳$ Compute the scaled residual
15: t ← λℷ[i] $⊳$ Compute threshold parameter
16: ^ β

_i ← τ

$⊳$ Update the i’th element of Beta
17: end for
18: while DG (X,Y,

,λ) >

19:

return

Algorithm 11: FISTA with backtracking line search and duality gap convergence criteria

Input:

∈ ℝ $⊳$ Duality gap target

y_k−1 ∈ ℝ^b $⊳$ Beta vector from previous iteration of FISTA
x_k−1 ∈ ℝ^b $⊳$ Intermediary vector from previous iteration of FISTA
1: y_k ← β
2: do
3: y_k−1 ← x_k
4: t_k ← 0
5: ^yk

← τ(β −

∇f(X,Y,y_k))
6: while f_β(X,Y, y^k

) > f(X,Y, ^yk

,y_k,L) do
7: L ← ηL
8: ^yk

← τ(β −

∇f(X,Y,β))
9: end while
10: x_k−1 ← x_k
11: x_k ← τ(β −

∇f(X,Y,

))
12: t_k+1 = ( √ ----) -1+--1+4t2k-- 2

13: y_k ← x_k + (tkt−k+11)

14: while DG (X,Y,β,λ) >

15:

return y_k,y_k−1,x_k−1,t_k+1

Algorithm 12: λGRID

Input:

X ∈ ℝ^n×m $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
M ∈ ℕ $⊳$ The number of grid elements required

1: r_max ← 2||X^TY ||_∞
2: r_min ← --1- 1000

r_max
3: Δ_r ← (rmax − rmin)

4: Let Λ ∈ ℝ^M $⊳$ Initialize empty array of size M
5: for i ∈ [1,2,…,M] do
6: δ_l ← Δ_r i M−-1

+ r_min $⊳$ Compute linear step
7: Λ[i] ← 10^δ_l $⊳$ Convert to logarithmic step
8: end for

return Λ

Algorithm 13: SCC: Stats Continutation Condition

Input:

C ∈ ℝ $⊳$ The design matrix
statsIt ∈ ℕ $⊳$ The vector of predictors
λ ∈ ℝ $⊳$ Current grid element
Λ ∈ ℝ^M $⊳$ Vector of grid elements
X ∈ ℝ^n×p $⊳$ Vector of grid elements
β_s ∈ ℝ^n×M $⊳$ Betas matrix

1: condition ← false
2: for i ∈ [1,2,…,statsIt] do
3: r_k ← Λ_k
4: Δ_β ← β_statsIt − β_i
5: check ← -n||Δβ||∞-- rstatsIt+rk

6: condition ← condition & ( check ≤ C )
7: end for

return condition

Algorithm 14: FOS

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝⁿ $⊳$ Starting vector
L₀ ∈ ℝ $⊳$ Initial Lipschitz constant, used by backtracking line search
M ∈ M $⊳$ Number of grid elements
η ∈ ℝ $⊳$ Step size when updating Lipschitz constant
C ∈ ℝ
γ ∈ ℝ

←

. $⊳$ Normalize X to mean 0 and standard deviation 1.
2:

←

. $⊳$ Normalize Y.
3: Λ ← λGRID(X,Y,M) $⊳$ Initialize grid elements
4: β_s ∈ ℝ^n×m = 0_n,m. $⊳$ Initialize matrix of Betas to zero matrix
5: while statsCont & ( statsCont < M ) do
6: stats_It ← stats_It + 1
7:

← β_k−1 $⊳$ Initialize old beta vector with the k - 1’th Column of the Betas matrix.
8: r_statsIt ← Λ_k $⊳$ Extract the k’th grid element.
9: if DG(X,Y,β_k,r_statsIt) ≤ DGT(γ,C,r_statsIt,n) then
10: β_k ← β_k−1
11: else
12: β_k ← ISTA(X,Y,β_k−1,L₀,r_statsIt,η,gap)
13: end if
14: statsCont ← SCC (C,statsIt,r_statsIt,Λ,X,β_s)
15: end while

return β_statsIt−1,Λ_statsIt,statsIt

Algorithm 15: DP ( Dual Point )

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝ^p $⊳$ Current β
λ ∈ ℝ $⊳$ Grid element

1: R ← Y − Xβ
2: α ← ||XT1R||∞--

3: s ← min{max{ Y-TR2 λ||R||2

,−α},α}

return sR

Algorithm 16: DG2 ( Duality Gap for Problem 1 )

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝ^p $⊳$ Current primal point
ν ∈ ℝⁿ $⊳$ Current dual point
λ ∈ ℝ $⊳$ Grid element

1: f_β ←

||Y − Xβ||2 2 + λ||β||1 $⊳$ Primal Objective function
2: d_ν ← 1 2

||Y ||2 2 − λ2 2

||ν −

||2 2 $⊳$ Dual Objective function

return f_β − d_ν

Algorithm 17: SAS ( Safe Active Set )

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
c ∈ ℝⁿ $⊳$ Center of the ball
r ≥ 0 $⊳$ Radius of the ball

←∅ $⊳$ Initialize Active Set With Empty Set
2: for j ∈{1,…,p} do
3: if |X_j^Tc| + r∥X_j∥₂ ≥ 1 then
4:

←

∪{j}
5: end if
6: end for

return

Algorithm 18: CDSR (Coordinate Descent With Lazy Evaluation and Screening Rule)

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
β ∈ ℝ^p $⊳$ Starting vector
λ ∈ ℝ $⊳$ Grid element

∈ ℝ $⊳$ Duality gap target

← β $⊳$ Make a copy of β
2: R ← Y − X ^ β

$⊳$ Initialize Intermediary Residual
3:

←{1,…,p} $⊳$ Initialize Active Set
4: optimCont ← true
5: while optimCont do
6: ν ← DP(X,Y,

,λ) $⊳$ Dual point
7:

← DG2(X,Y,

,ν,λ) $⊳$ Duality gap
8:

← SAS(X,ν, ∘ -2-- λ2G

) $⊳$ Safe Active Set
9: if

≤

then
10: optimCont ← false
11: else
12: for i ∈

do
13: t ← --λ-- ||Xi||22

$⊳$ Scale grid element by norm of the i’th column of design matrix
14: if

_i≠0 then R ← R + X_i

_i
15: end if
16:

_i ← τ

$⊳$ Update the i’th element of Beta
17: if

_i≠0 then R ← R − X_i

_i
18: end if
19: end for
20: end if
21:

_^c = 0 $⊳$ Set to 0 coefficients not in

22: end while

return

Algorithm 19: FOS With Screening Rule

Input:

X ∈ ℝ^n×p $⊳$ The design matrix
Y ∈ ℝⁿ $⊳$ The vector of predictors
M ∈ ℕ $⊳$ Number of grid elements
C > 0
γ > 0

←

$⊳$ Normalize X to mean 0 and standard deviation 1.
2:

←

$⊳$ Normalize Y.
3: Λ ← λGRID( ^ X

,M) $⊳$ Initialize grid elements
4: β_s ∈ ℝ^p×M ← 0_p,M $⊳$ Initialize matrix of Betas to zero matrix
5: statsCont ← true
6: statsIt ← 1
7: while statsCont & ( statsIt < M ) do
8: statsIt ← statsIt + 1
9:

← DGT(γ,C,Λ_statsIt,n) $⊳$ Duality gap target
10: β_statsIt ← CDSR(

,β_statsIt−1,Λ_statsIt∕2,

∕2)
11: statsCont ← SCC (C,statsIt,Λ_statsIt,Λ,X,β_s)
12: end while

return β_statsIt−1,Λ_statsIt−1,statsIt