tceic.com

学霸学习网 这下你爽了

学霸学习网 这下你爽了

- Online Dictionary Learning for Sparse Coding
- Submodular Dictionary Learning for Sparse Coding
- 判别字典学习Learning A Discriminative Dictionary for Sparse Representation
- latent dictionary learning for sparse representation based classification
- %Yang CVPR2014 - Latent Dictionary Learning for Sparse Representation based Classification
- Fisher Discrimination Dictionary Learning for Sparse Representation-ICCV_final
- Supervised Class-Specific Dictionary Learning for Sparse Modeling in Action Recognition
- dictionary learning and sparse representation based single-image super-resolution reconstruction
- Online Dictionary Learning for Sparse Coding(ICML 2009)
- Sparse Representation for Color Image Restoration

Published as: Neural Comput. 2003 February ; 15(2): 349–396.

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Dictionary Learning Algorithms for Sparse Representation

Kenneth Kreutz-Delgado, Electrical and Computer Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California 92093-0407, U.S.A Joseph F. Murray, Electrical and Computer Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California 92093-0407, U.S.A Bhaskar D. Rao, Electrical and Computer Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California 92093-0407, U.S.A Kjersti Engan, Stavanger University College, School of Science and Technology Ullandhaug, N-4091 Stavanger, Norway Te-Won Lee, and Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute, La Jolla, California 92037, U.S.A Terrence J. Sejnowski Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute, La Jolla, California 92037, U.S.A

Kenneth Kreutz-Delgado: kreutz@ece.ucsd.edu; Joseph F. Murray: jfmurray@ucsd.edu; Bhaskar D. Rao: brao@ece.ucsd.edu; Kjersti Engan: kjersti.engan@tn.his.no; Te-Won Lee: tewon@salk.edu; Terrence J. Sejnowski: terry@salk.edu

Abstract

Algorithms for data-driven learning of domain-specific overcomplete dictionaries are developed to obtain maximum likelihood and maximum a posteriori dictionary estimates based on the use of Bayesian models with concave/Schur-concave (CSC) negative log priors. Such priors are appropriate for obtaining sparse representations of environmental signals within an appropriately chosen (environmentally matched) dictionary. The elements of the dictionary can be interpreted as concepts, features, or words capable of succinct expression of events encountered in the environment (the source of the measured signals). This is a generalization of vector quantization in that one is interested in a description involving a few dictionary entries (the proverbial “25 words or less”), but not necessarily as succinct as one entry. To learn an environmentally adapted dictionary capable of concise expression of signals generated by the environment, we develop algorithms that iterate between a representative set of sparse representations found by variants of FOCUSS and an update of the dictionary using these sparse representations. Experiments were performed using synthetic data and natural images. For complete dictionaries, we demonstrate that our algorithms have improved performance over other independent component analysis (ICA) methods, measured in terms of signal-to-noise ratios of separated sources. In the overcomplete case, we show that the true underlying dictionary and sparse sources can be accurately recovered. In tests with natural images, learned overcomplete dictionaries are shown to have higher coding efficiency than complete dictionaries; that is, images encoded with an over-complete

Kreutz-Delgado et al.

Page 2

dictionary have both higher compression (fewer bits per pixel) and higher accuracy (lower mean square error).

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

1 Introduction

FOCUSS, which stands for FOCal Underdetermined System Solver, is an algorithm designed to obtain suboptimally (and, at times, maximally) sparse solutions to the following m × n, underdetermined linear inverse problem1 (Gorodnitsky, George, & Rao, 1995;Rao & Gorodnitsky, 1997;Gorodnitsky & Rao, 1997;Adler, Rao, & Kreutz-Delgado, 1996;Rao & Kreutz-Delgado, 1997;Rao, 1997,1998),

(1.1)

for known A. The sparsity of a vector is the number of zero-valued elements (Donoho, 1994), and is related to the diversity, the number of nonzero elements,

Since our initial investigations into the properties of FOCUSS as an algorithm for providing sparse solutions to linear inverse problems in relatively noise-free environments (Gorodnitsky et al., 1995; Rao, 1997; Rao & Gorodnitsky, 1997; Gorodnitsky & Rao, 1997; Adler et al., 1996; Rao & Kreutz-Delgado, 1997), we now better understand the behavior of FOCUSS in noisy environments (Rao & Kreutz-Delgado, 1998a, 1998b) and as an interior point-like optimization algorithm for optimizing concave functionals subject to linear constraints (Rao & Kreutz-Delgado, 1999; Kreutz-Delgado & Rao, 1997, 1998a, 1998b, 1998c, 1999; KreutzDelgado, Rao, Engan, Lee, & Sejnowski, 1999a; Engan, Rao, & Kreutz-Delgado, 2000; Rao, Engan, Cotter, & Kreutz-Delgado, 2002). In this article, we consider the use of the FOCUSS algorithm in the case where the matrix A is unknown and must be learned. Toward this end, we first briefly discuss how the use of concave (and Schur concave) functionals enforces sparse solutions to equation 1.1. We also discuss the choice of the matrix, A, in equation 1.1 and its relationship to the set of signal vectors y for which we hope to obtain sparse representations. Finally, we present algorithms capable of learning an environmentally adapted dictionary, A, given a sufficiently large and statistically representative sample of signal vectors, y, building on ideas originally presented in Kreutz-Delgado, Rao, Engan, Lee, and Sejnowski (1999b), Kreutz-Delgado, Rao, and Engan (1999), and Engan, Rao, and Kreutz-Delgado (1999). We refer to the columns of the full row-rank m × n matrix A,

(1.2)

as a dictionary, and they are assumed to be a set of vectors capable of providing a highly succinct representation for most (and, ideally, all) statistically representative signal vectors y ∈ ?m. Note that with the assumption that rank(A) = m, every vector y has a representation; the

1For notational simplicity, we consider the real case only. Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 3

question at hand is whether this representation is likely to be sparse. We call the statistical generating mechanism for signals, y, the environment and a dictionary, A, within which such signals can be sparsely represented an environmentally adapted dictionary. Environmentally generated signals typically have significant statistical structure and can be represented by a set of basis vectors spanning a lower-dimensional submanifold of meaningful signals (Field, 1994; Ruderman, 1994). These environmentally meaningful representation vectors can be obtained by maximizing the mutual information between the set of these vectors (the dictionary) and the signals generated by the environment (Comon, 1994; Bell & Sejnowski, 1995; Deco & Obradovic, 1996; Olshausen & Field, 1996; Zhu, Wu, & Mumford, 1997; Wang, Lee, & Juang, 1997). This procedure can be viewed as a natural generalization of independent component analysis (ICA) (Comon, 1994; Deco & Obradovic, 1996). As initially developed, this procedure usually results in obtaining a minimal spanning set of linearly independent vectors (i.e., a true basis). More recently, the desirability of obtaining “overcomplete” sets of vectors (or “dictionaries”) has been noted (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000; Coifman & Wickerhauser, 1992; Mallat & Zhang, 1993; Donoho, 1994; Rao & KreutzDelgado, 1997). For example, projecting measured noisy signals onto the signal submanifold spanned by a set of dictionary vectors results in noise reduction and data compression (Donoho, 1994, 1995). These dictionaries can be structured as a set of bases from which a single basis is to be selected to represent the measured signal(s) of interest (Coifman & Wickerhauser, 1992) or as a single, overcomplete set of individual vectors from within which a vector, y, is to be sparsely represented (Mallat & Zhang, 1993; Olshausen & Field, 1996; Lewicki & Sejnowski, 2000; Rao & Kreutz-Delgado, 1997). The problem of determining a representation from a full row-rank over-complete dictionary, A = [a1, …, an], n ? m, for a specific signal measurement, y, is equivalent to solving an underdetermined inverse problem, Ax = y, which is nonuniquely solvable for any y. The standard least-squares solution to this problem has the (at times) undesirable feature of involving all the dictionary vectors in the solution2 (the “spurious artifact” problem) and does not generally allow for the extraction of a categorically or physically meaningful solution. That is, it is not generally the case that a least-squares solution yields a concise representation allowing for a precise semantic meaning.3 If the dictionary is large and rich enough in representational power, a measured signal can be matched to a very few (perhaps even just one) dictionary words. In this manner, we can obtain concise semantic content about objects or situations encountered in natural environments (Field, 1994). Thus, there has been significant interest in finding sparse solutions, x (solutions having a minimum number of nonzero elements), to the signal representation problem. Interestingly, matching a specific signal to a sparse set of dictionary words or vectors can be related to entropy minimization as a means of elucidating statistical structure (Watanabe, 1981). Finding a sparse representation (based on the use of a “few” code or dictionary words) can also be viewed as a generalization of vector quantization where a match to a single “code vector” (word) is always sought (taking “code book” = “dictionary”).4 Indeed, we can refer to a sparse solution, x, as a sparse coding of the signal instantiation, y. 1.1 Stochastic Models It is well known (Basilevsky, 1994) that the stochastic generative model

2This fact comes as no surprise when the solution is interpreted within a Bayesian framework using a gaussian (maximum entropy) prior. 3Taking “semantic” here to mean categorically or physically interpretable. 4For example, n = 100 corresponds to 100 features encoded via vector quantization (“one column = one concept”). If we are allowed to represent features using up to four columns, we can encode showing a combinatorial boost in expressive power. Neural Comput. Author manuscript; available in PMC 2010 September 23. concepts

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Kreutz-Delgado et al.

Page 4

(1.3)

can be used to develop algorithms enabling coding of y ∈ ?m via solving the inverse problem for a sparse solution x ∈ ?n for the undercomplete (n < m) and complete (n = m) cases. In recent years, there has been a great deal of interest in obtaining sparse codings of y with this procedure for the overcomplete (n > m) case (Mallat & Zhang, 1993; Field, 1994). In our earlier work, we have shown that given an overcomplete dictionary, A (with the columns of A comprising the dictionary vectors), a maximum a posteriori (MAP) estimate of the source vector, x, will yield a sparse coding of y in the low-noise limit if the negative log prior, ?log (P(x)), is concave/Schur-concave (CSC) (Rao, 1998; Kreutz-Delgado & Rao, 1999), as discussed below. For P(x) factorizable into a product of marginal probabilities, the resulting code is also known to provide an independent component analysis (ICA) representation of y. More generally, a CSC prior results in a sparse representation even in the nonfactorizable case (with x then forming a dependent component analysis, or DCA, representation). Given independently and identically distributed (i.i.d.) data, Y = YN = (y1, …, yN), assumed to be generated by the model 1.3, a maximum likelihood estimate, ?ML, of the unknown (but nonrandom) dictionary A can be determined as (Olshausen & Field, 1996;Lewicki & Sejnowski, 2000)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

This requires integrating out the unobservable i.i.d. source vectors, X = XN = (x1, …, xN), in order to compute P(Y; A) from the (assumed) known probabilities P(x) and P(ν). In essence, X is formally treated as a set of nuisance parameters that in principle can be removed by integration. However, because the prior P(x) is generally taken to be supergaussian, this integration is intractable or computationally unreasonable. Thus, approximations to this integration are performed that result in an approximation to P(Y; A), which is then maximized with respect to Y. A new, better approximation to the integration can then be made, and this process is iterated until the estimate of the dictionary A has (hopefully) converged (Olshausen & Field, 1996). We refer to the resulting estimate as an approximate maximum likelihood (AML) estimate of the dictionary A (denoted here by ?AML). No formal proof of the convergence of this algorithm to the true maximum likelihood estimate, ?ML, has been given in the prior literature, but it appears to perform well in various test cases (Olshausen & Field, 1996). Below, we discuss the problem of dictionary learning within the framework of our recently developed log-prior model-based sparse source vector learning approach that for a known overcomplete dictionary can be used to obtain sparse codes (Rao, 1998; Kreutz-Delgado & Rao, 1997, 1998b, 1998c, 1999; Rao & Kreutz-Delgado, 1999). Such sparse codes can be found using FOCUSS, an affine scaling transformation (AST)–like iterative algorithm that finds a sparse locally optimal MAP estimate of the source vector x for an observation y. Using these results, we can develop dictionary learning algorithms within the AML framework and for obtaining a MAP-like estimate, ?MAP, of the (now assumed random) dictionary, A, assuming in the latter case that the dictionary belongs to a compact submanifold corresponding to unit Frobenius norm. Under certain conditions, convergence to a local minimum of a MAPloss function that combines functions of the discrepancy e = (y ? Ax) and the degree of sparsity in x can be rigorously proved.

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 5

1.2 Related Work Previous work includes efforts to solve equation 1.3 in the overcomplete case within the maximum likelihood (ML) framework. An algorithm for finding sparse codes was developed in Olshausen and Field (1997) and tested on small patches of natural images, resulting in Gaborlike receptive fields. In Lewicki and Sejnowski (2000) another ML algorithm is presented, which uses the Laplacian prior to enforce sparsity. The values of the elements of x are found with a modified conjugate gradient optimization (which has a rather complicated implementation) as opposed to the standard ICA (square mixing matrix) case where the coefficients are found by inverting the A matrix. The difficulty that arises when using ML is that finding the estimate of the dictionary A requires integrating over all possible values of the joint density P(y, x; A) as a function of x. In Olshausen and Field (1997), this is handled by assuming the prior density of x is a delta function, while in Lewicki and Sejnowski (2000), it is approximated by a gaussian. The fixed-point FastICA (Hyv?rinen, Cristescu, & Oja, 1999) has also been extended to generate overcomplete representations. The FastICA algorithm can find the basis functions (columns of the dictionary A) one at a time by imposing a quasiorthogonality condition and can be thought of as a “greedy” algorithm. It also can be run in parallel, meaning that all columns of A are updated together. Other methods to solve equation 1.3 in the overcomplete case have been developed using a combination of the expectation-maximization (EM) algorithm and variational approximation techniques. Independent factor analysis (Attias, 1999) uses a mixture of gaussians to approximate the prior density of the sources, which avoids the difficulty of integrating out the parameters X and allows different sources to have different densities. In another method (Girolami, 2001), the source priors are assumed to be supergaussian (heavy-tailed), and a variational lower bound is developed that is used in the EM estimation of the parameters A and X. It is noted in Girolami (2001) that the mixtures used in independent factor analysis are more general than may be needed for the sparse overcomplete case, and they can be computationally expensive as the dimension of the data vector and number of mixtures increases. In Zibulevsky and Pearlmutter (2001), the blind source separation problem is formulated in terms of a sparse source underlying each unmixed signal. These sparse sources are expanded into the unmixed signal with a predefined wavelet dictionary, which may be overcomplete. The unmixed signals are linearly combined by a different mixing matrix to create the observed sensor signals. The method is shown to give better separation performance than ICA techniques. The use of learned dictionaries (instead of being chosen a priori) is suggested.

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

2 FOCUSS: Sparse Solutions for Known Dictionaries

2.1 Known Dictionary Model A Bayesian interpretation is obtained from the generative signal model, equation 1.3, by assuming that x has the parameterized (generally nongaussian) probability density function (pdf),

(2.1)

with parameter vector p. Similarly, the noise ν is assumed to have a parameterized (possibly nongaussian) density Pq(ν) of the same form as equation 2.1 with parameter vector q. It is assumed that x and ν have zero means and that their densities obey the property d(x) = d(|x|), for | · | defined component-wise. This is equivalent to assuming that the densities are symmetric with respect to sign changes in the components x, x[i] ← ?x[i] and therefore that the skews of these densities are zero. We also assume that d(0) = 0. With a slight abuse of notation, we allow

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 6

the differing subscripts q and p to indicate that dq and dp may be functionally different as well as parametrically different. We refer to densities like equation 2.1 for suitable additional constraints on dp(x), as hypergeneralized gaussian distributions (Kreutz-Delgado & Rao, 1999; Kreutz-Delgado et al., 1999).

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

If we treat A, p, and q as known parameters, then x and y are jointly distributed as

Bayes’ rule yields

(2.2)

(2.3)

Usually the dependence on p and q is notationally suppressed, for example, β = P(y; A). Given an observation, y, maximizing equation 2.2 with respect to x yields a MAP estimate x?. This ideally results in a sparse coding of the observation, a requirement that places functional constraints on the pdfs, and particularly on dp. Note that β is independent of x and can be ignored when optimizing equation 2.2 with respect to the unknown source vector x. The MAP estimate equivalently is obtained from minimizing the negative logarithm of P(x | y), which is

(2.4)

where λ = γp/γq and dq(y?Ax) = dq(Ax?y) by our assumption of symmetry. The quantity is interpretable as a signal-to-noise ratio (SNR). Furthermore, assuming that both dq and dp are concave/Schur–concave (CSC) as defined in section 2.4, then the term dq(y ? Ax) in equation 2.4 will encourage sparse residuals, e = y ? Ax?, while the term dp(x) encourages sparse source vector estimates, x?. A given value of λ then determines a trade-off between residual and source vector sparseness. This most general formulation will not be used here. Although we are interested in obtaining sparse source vector estimates, we will not enforce sparsity on the residuals but instead, to simplify the development, will assume the q = 2 i.i.d. gaussian measurement noise case (ν gaussian with known covariance σ2 · I), which corresponds to taking,

(2.5)

In this case, problem 2.4 becomes

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 7

(2.6)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

In either case, we note that λ → 0 as γp → 0 which (consistent with the generative model, 1.3) we refer to as the low noise limit. Because the mapping A is assumed to be onto, in the lownoise limit, the optimization, equation 2.4, is equivalent to the linearly constrained problem,

(2.7)

In the low-noise limit, no sparseness constraint need be placed on the residuals, e = y ? Ax?, which are assumed to be zero. It is evident that the structure of dp(·) is critical for obtaining a sparse coding, x?, of the observation y (Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999). Throughtout this article, the quantity dp(x) is always assumed to be CSC (enforcing sparse solutions to the inverse problem 1.3). As noted, and as will be evident during the development of dictionary learning algorithms below, we do not impose a sparsity constraint on the residuals; instead, the measurement noise ν will be assumed to be gaussian (q = 2). 2.2 Independent Component Analysis and Sparsity Inducing Priors An important class of densities is given by the generalized gaussians for which

(2.8)

for p > 0 (Kassam, 1982). This is a special case of the larger ?p class (the p-class) of functions, which allows p to be negative in value (Rao & Kreutz-Delgado, 1999; Kreutz-Delgado & Rao, 1997). Note that this function has the special property of separability,

which corresponds to factorizability of the density Pp(x),

and hence to independence of the components of x. The assumption of independent components allows the problem of solving the generative model, equation 1.3, for x to be interpreted as an ICA problem (Comon, 1994; Pham, 1996; Olshausen & Field, 1996; Roberts, 1998). It is of interest, then, to consider the development of a large class of parameterizable separable functions dp(x) consistent with the ICA assumption (Rao & Kreutz-Delgado, 1999; KreutzDelgado & Rao, 1997). Given such a class, it is natural to examine the issue of finding a best fit within this class to the “true” underlying prior density of x. This is a problem of parametric density estimation of the true prior, where one attempts to find an optimal choice of the model

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 8

density Pp(x) by an optimization over the parameters p that define the choice of a prior from within the class. This is, in general, a difficult problem, which may require the use of Monte Carlo, evolutionary programming, or stochastic search techniques. Can the belief that supergaussian priors, Pp(x), are appropriate for finding sparse solutions to equation 1.3 (Field, 1994; Olshausen & Field, 1996) be clarified or made rigorous? It is well known that the generalized gaussian distribution arising from the use of equation 2.8 yields supergaussian distributions (positive kurtosis) for p < 2 and subgaussian (negative kurtosis) for p > 2. However, one can argue (see section 2.5 below) that the condition for obtaining sparse solutions in the low-noise limit is the stronger requirement that p ≤ 1, in which case the separable function dp(x) is CSC. This indicates that supergaussianness (positive kurtosis) alone is necessary but not sufficient for inducing sparse solutions. Rather, sufficiency is given by the requirement that ?log Pp(x) ≈ dp(x) be CSC. We have seen that the function dp(x) has an interpretation as a (negative logarithm of) a Bayesian prior or as a penalty function enforcing sparsity in equation 2.4, where dp(x) should serve as a “relaxed counting function” on the nonzero elements of x. Our perspective emphasizes that dp(x) serves both of these goals simultaneously. Thus, good regularizing functions, dp(x), should be flexibly parameterizable so that Pp(x) can be optimized over the parameter vector p to provide a good parametric fit to the underlying environmental pdf, and the functions should also have analytical properties consistent with the goal of enforcing sparse solutions. Such properties are discussed in the next section. 2.3 Majorization and Schur-Concavity In this section, we discuss functions that are both concave and Schur-concave (CSC functions; Marshall & Olkin, 1979). We will call functions, dp(·), which are CSC, diversity functions, anticoncentration functions or antisparsity functions. The larger the value of the CSC function dp(x), the more diverse (i.e., the less concentrated or sparse) the elements of the vector x are. Thus, minimizing dp(x) with respect to x results in less diverse (more concentrated or sparse) vectors x. 2.3.1 Schur-Concave Functions—A measure of the sparsity of the elements of a solution vector x (or the lack thereof, which we refer to as the diversity of x) is given by a partial ordering on vectors known as the Lorentz order. For any vector in the positive orthant, , define the decreasing rearrangement

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

and the partial sums (Marshall & Olkin, 1979; Wickerhauser, 1994),

We say that y majorizes x, y > x, iff for k = 1, …, n,

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 9

and the vector y is said to be more concentrated, or less diverse, than x. This partial order defined by majorization then defines the Lorentz order. We are interested in scalar-valued functions of x that are consistent with majorization. Such functions are known as Schur-concave functions, . They are defined to be precisely the class of functions consistent with the Lorentz order,

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

In words, if y is less diverse than x (according to the Lorentz order) then d(y) is less than d(x) for d(·) Schur-concave. We assume that Schur-concavity is a necessary condition for d(·) to be a good measure of diversity (antisparsity). 2.3.2 Concavity Yields Sparse Solutions—Recall that a function d(·) is concave on the positive orthant iff (Rockafellar, 1970)

?x, , ?γ, 0 ≤ γ ≤ 1. In addition, a scalar function is said to be permutation invariant if its value is independent of rearrangements of its components. An important fact is that for permutation invariant functions, concavity is a sufficient condition for Schur-concavity:

Now consider the low-noise sparse inverse problem, 2.7. It is well known that subject to linear takes its minima on the boundary of (Rockafellar, constraints, a concave function on 1970), and as a consequence these minima are therefore sparse. We take concavity to be a sufficient condition for a permutation invariant d(·) to be a measure of diversity and obtain sparsity as constrained minima of d(·). More generally, a diversity measure should be somewhere between Schur-concave and concave. In this spirit, one can define almost concave functions (Kreutz-Delgado & Rao, 1997), which are Schur-concave and (locally) concave in all n directions but one, which also are good measures of diversity. 2.3.3 Separability, Schur-Concavity, and ICA—The simplest way to ensure that d(x) be permutation invariant (a necessary condition for Schur-concavity) is to use functions that are separable. Recall that separability of dp(x) corresponds to factorizability of Pp(x). Thus, separability of d(x) corresponds to the assumption of independent components of x under the model 1.3. We see that from a Bayesian perspective, separability of d(x) corresponds to a generative model for y that assumes a source, x, with independent components. With this assumption, we are working within the framework of ICA (Nadal & Parga, 1994;Pham, 1996;Roberts, 1998). We have developed effective algorithms for solving the optimization problem 2.7 for sparse solutions when dp(x) is separable and concave (Kreutz-Delgado & Rao, 1997;Rao & Kreutz-Delgado, 1999). It is now evident that relaxing the restriction of separability generalizes the generative model to the case where the source vector, x, has dependent components. We can reasonably call an approach based on a nonseparable diversity measure d(x) a dependent component analysis (DCA). Unless care is taken, this relaxation can significantly complicate the analysis and

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 10

development of optimization algorithms. However, one can solve the low-noise DCA problem, at least in principle, provided appropriate choices of nonseparable diversity functions are made. 2.4 Supergaussian Priors and Sparse Coding

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

The P-class of diversity measures for 0 < p ≤ 1 result in sparse solutions to the low-noise coding problem, 2.7. These separable and concave (and thus Schur-concave) diversity measures correspond to supergaussian priors, consistent with the folk theorem that supergaussian priors are sparsity-enforcing priors. However, taking 1 ≤ p < 2 results in supergaussian priors that are not sparsity enforcing. Taking p to be between 1 and 2 yields a dp(x) that is convex and therefore not concave. This is consistent with the well-known fact that for this range of p, the pth-root of dp(x) is a norm. Minimizing dp(x) in this case drives x toward the origin, favoring concentrated rather than sparse solutions. We see that if a sparse coding is to be found based on obtaining a MAP estimate to the low-noise generative model, 1.3, then, in a sense, supergaussianness is a necessary but not sufficient condition for a prior to be sparsity enforcing. A sufficient condition for obtaining a sparse MAP coding is that the negative log prior be CSC. 2.5 The FOCUSS Algorithm Locally optimal solutions to the known-dictionary sparse inverse problems in gaussian noise, equations 2.6 and 2.7, are given by the FOCUSS algorithm. This is an affine-scaling transformation (AST)-like (interior point) algorithm originally proposed for the low-noise case 2.7 in Rao and Kreutz-Delgado (1997, 1999) and Kreutz-Delgado and Rao (1997); and extended by regularization to the nontrivial noise case, equation 2.6, in Rao and KreutzDelgado (1998a), Engan et al. (2000), and Rao et al. (2002). In these references, it is shown that the FOCUSS algorithm has excellent behavior for concave functions (which includes the the CSC concentration functions) dp(·). For such functions, FOCUSS quickly converges to a local minimum, yielding a sparse solution to problems 2.7 and 2.6. One can quickly motivate the development of the FOCUSS algorithm appropriate for solving the optimization problem 2.6 by considering the problem of obtaining the stationary points of the objective function. These are given as solutions, x*, to

(2.9)

In general, equation 2.9 is nonlinear and cannot be explicitly solved for a solution x*. However, we proceed by assuming the existence of a gradient factorization,

(2.10)

where α(x) is a positive scalar function and Π(x) is symmetric, positive-definite, and diagonal. As discussed in Kreutz-Delgado and Rao (1997, 1998c) and Rao and Kreutz-Delgado (1999), this assumption is generally true for CSC sparsity functions dp(·) and is key to understanding FOCUSS as a sparsity-inducing interior-point (AST-like) optimization algorithm.5 With the gradient factorization 2.10, the stationary points of equation 2.9 are readily shown to be solutions to the (equally nonlinear and implicit) system,

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 11

(2.11)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

(2.12)

where β(x) = λα(x) and the second equation follows from identity A.18. Although equation 2.12 is also not generally solvable in closed form, it does suggest the following relaxation algorithm,

(2.13)

which is to be repeatedly reiterated until convergence. Taking β ≡ 0 in equation 2.13 yields the FOCUSS algorithm proved in Kreutz-Delgado and Rao (1997, 1998c) and Rao and Kreutz-Delgado (1999) to converge to a sparse solution of equation 2.7 for CSC sparsity functions dp(·). The case β ≠ 0 yields the regularized FOCUSS algorithm that will converge to a sparse solution of equation 2.6 (Rao, 1998; Engan et al., 2000; Rao et al., 2002). More computationally robust variants of equation 2.13 are discussed elsewhere (Gorodnitsky & Rao, 1997; Rao & Kreutz-Delgado, 1998a). Note that for the general regularized FOCUSS algorithm, 2.13, we have β(x?) = λα(x?), where λ is the regularization parameter in equation 2.4. The function β(x) is usually generalized to be a function of x?, y and the iteration number. Methods for choosing λ include the quality-of-fit criteria, the sparsity critera, and the L-curve (Engan, 2000;Engan et al., 2000;Rao et al., 2002). The quality-of-fit criterion attempts to minimize the residual error y ? Ax (Rao, 1997), which can be shown to converge to a sparse solution (Rao & Kreutz-Delgado, 1999). The sparsity criterion requires that a certain number of elements of each xk be nonzero. The L-curve method adjusts λ to optimize the trade-off between the residual and sparsity of x. The plot of dp(x) versus dq(y ? Ax) has an L shape, the corner of which provides the best trade-off. The corner of the L-curve is the point of maximum curvature and can be found by a one-dimensional maximization of the curvature function (Hansen & O’Leary, 1993). A hybrid approach known as the modified L-curve method combines the L-curve method on a linear scale and the quality-of-fit criterion, which is used to place limits on the range of λ that can be chosen by the L-curve (Engan, 2000). The modified L-curve method was shown to have good performance, but it requires a one-dimensional numerical optimization step for each xk at each iteration, which can be computationally expensive for large vectors.

5This interpretation, which is not elaborated on here, follows from defining a diagonal positive-definite affine scaling transformation matrix W(x) by the relation

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 12

3 Dictionary Learning

3.1 Unknown, Nonrandom Dictionaries The MLE framework treats parameters to be estimated as unknown but deterministic (nonrandom). In this spirit, we take the dictionary, A, to be the set of unknown but deterministic parameters to be estimated from the observation set Y = YN. In particular, given YN, the maximum likelihood estimate ?ML is found from maximizing the likelihood function L(A | YN) = P(YN; A). Under the assumption that the observations are i.i.d., this corresponds to the optimization

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

(3.1)

(3.2)

Defining the sample average of a function f (y) over the sample set YN = (y1, …, yN) by

the optimization 3.1 can be equivalently written as

(3.3)

Note that P(yk; A) is equal to the normalization factor β already encountered, but now with the dependence of β on A and the particular sample, yk, made explicit. The integration in equation 3.2 in general is intractable, and various approximations have been proposed to obtain an approximate maximum likelihood estimate, ?AML (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000). In particular, the following approximation has been proposed (Olshausen & Field, 1996),

(3.4)

where

(3.5)

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 13

for k = 1, …, N, assuming a current estimate, ?, for A. This approximation corresponds to assuming that the source vector xk for which yk = Axk is known and equal to x?k(?). With this approximation, the optimization 3.3 becomes

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

(3.6)

which is an optimization over the sample average ? · ?N of the functional 2.4 encountered earlier. Updating our estimate for the dictionary,

(3.7)

we can iterate the procedure (3.5)–(3.6) until ?AML has converged, hopefully (at least in the limit of large N) to ?ML = ?ML(YN) as the maximum likelihood estimate ?ML(YN) has wellknown desirable asymptotic properties in the limit N → ∞. Performing the optimization in equation 3.6 for the q = 2 i.i.d. gaussian measurement noise case (ν gaussian with known covariance σ2 · I) corresponds to taking

(3.8)

in equation 3.6. In appendix A, it is shown that we can readily obtain the unique batch solution,

(3.9)

(3.10)

Appendix A derives the maximum likelihood estimate of A for the ideal case of known source vectors X = (x1, …, xN),

which is, of course, actually not computable since the actual source vectors are assumed to be hidden. As an alternative to using the explicit solution 3.9, which requires an often prohibitive n × n inversion, we can obtain AAML iteratively by gradient descent on equations 3.6 and 3.8,

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 14

(3.11)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

for an appropriate choice of the (possibly adaptive) positive step-size parameter μ. The iteration 3.11 can be initialized as ?AML = ?. A general iterative dictionary learning procedure is obtained by nesting the iteration 3.11 entirely within the iteration defined by repeatedly solving equation 3.5 every time a new estimate, ?AML, of the dictionary becomes available. However, performing the optimization required in equation 3.5 is generally nontrivial (Olshausen & Field, 1996;Lewicki & Sejnowski, 2000). Recently, we have shown how the use of the FOCUSS algorithm results in an effective algorithm for performing the optimization required in equation 3.5 for the case when ν is gaussian (Rao & Kreutz-Delgado, 1998a;Engan et al., 1999). This approach solves equation 3.5 using the affine-scaling transformation (AST)-like algorithms recently proposed for the low-noise case (Rao & Kreutz-Delgado, 1997,1999;Kreutz-Delgado & Rao, 1997) and extended by regularization to the nontrivial noise case (Rao & Kreutz-Delgado, 1998a;Engan et al., 1999). As discussed in section 2.5, for the current dictionary estimate, ?, a solution to the optimization problem, 3.5, is provided by the repeated iteration,

(3.12)

k = 1, …, N, with Π(x) defined as in equation 3.18, given below. This is the regularized FOCUSS algorithm (Rao, 1998; Engan et al., 1999), which has an interpretation as an AST-like concave function minimization algorithm. The proposed dictionary learning algorithm alternates between iteration 3.12 and iteration 3.11 (or the direct batch solution given by equation 3.9 if the inversion is tractable). Extensive simulations show the ability of the AST-based algorithm to recover an unknown 20 × 30 dictionary matrix A completely (Engan et al., 1999). 3.2 Unknown, Random Dictionaries We now generalize to the case where the dictionary A and the source vector set X = XN = (x1, …, xN) are jointly random and unknown. We add the requirement that the dictionary is known to obey the constraint

A compact submanifold of ?m×n is necessarily closed and bounded. On the constraint submanifold, the dictionary A has the prior probability density function P(A), which in the sequel we assume has the simple (uniform on ) form,

(3.13)

where

(·) is the indicator function and c is a positive constant chosen to ensure that

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 15

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

The dictionary A and the elements of the set X are also all assumed to be mutually independent,

With the set of i.i.d. noise vectors, (ν1, …, νN) also taken to be jointly random with, and independent of, A and X, the observation set Y = YN = (y1, …, yN) is assumed to be generated by the model 1.3. With these assumptions, we have

(3.14)

using the facts that the observations are conditionally independent and P(yk | A, X) = P(yk | A, xk). The jointly MAP estimates

are found by maximizing a posteriori probability density P(A, X | Y) simultaneously with respect to A ∈ and X. This is equivalent to minimizing the negative logarithm of P(A, X | Y), yielding the optimization problem,

(3.15)

Note that this is a joint minimization of the sample average of the functional 2.4, and as such is a natural generalization of the single (with respect to the set of source vectors) optimization previously encountered in equation 3.6. By finding joint MAP estimates of A and X, we obtain a problem that is much more tractable than the one of finding the single MAP estimate of A (which involves maximizing the marginal posterior density P(A | Y)). The requirement that A ∈ , where is a compact and hence bounded subset of ?m×n, is sufficient for the optimization problem 3.15 to avoid the degenerate solution,6

6||A|| is any suitable matrix norm on A. Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 16

(3.16)

This solution is possible for unbounded A because y = Ax is almost always solvable for x since learned overcomplete A’s are (generically) onto, and for any solution pair (A, x) the pair ( , αx) is also a solution. This fact shows that the inverse problem of finding a solution pair (A, x) is generally ill posed unless A is constrained to be bounded (as we have explicitly done here) or the cost functional is chosen to ensure that bounded A’s are learned (e.g., by adding a term monotonic in the matrix norm ||A|| to the cost function in equation 3.15). A variety of choices for the compact set are available. Obviously, since different choices of correspond to different a priori assumptions on the set of admissible matrices, A, the choice of this set can be expected to affect the performance of the resulting dictionary learning algorithm. We will consider two relatively simple forms for . 3.3 Unit Frobenius–Norm Dictionary Prior For the i.i.d. q = 2 gaussian measurement noise case of equation 3.8, algorithms that provably converge (in the low step-size limit) to a local minimum of equation 3.15 can be readily developed for the very simple choice,

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

(3.17)

where ||A||F denotes the Frobenius norm of the matrix A,

and it is assumed that the prior P(A) is uniformly distributed on as per condition 3.13. As discussed in appendix A, is simply connected, and there exists a path in between any two matrices in . Following the gradient factorization procedure (Kreutz-Delgado & Rao, 1997; Rao & KreutzDelgado, 1999), we factor the gradient of d(x) as

(3.18)

where it is assumed that Π(x)is diagonal and positive definite for all nonzero x. For example, in the case where ,

(3.19)

Factorizations for other diversity measures d(x) are given in Kreutz-Delgado and Rao (1997). We also define β(x) = λα(x). As derived and proved in appendix A, a learning law that provably converges to a minimum of equation 3.15 on the manifold 3.17 is then given by

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 17

(3.20)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

for k = 1, …, N, where ? is initialized to ||?||F = 1, Ωk are n × n positive definite matrices, and the “error” δ?is

(3.21)

which can be rewritten in the perhaps a more illuminating form (cf. equations 3.9 and 3.10),

(3.22)

A formal convergence proof of equation 3.20 is given in appendix A, where it is also shown that the right-hand side of the second equation in 3.20 corresponds to projecting the error term δ? onto the tangent space of , thereby ensuring that the derivative of ? lies in the tangent space. Convergence of the algorithm to a local optimum of equation 3.15 is formally proved by interpreting the loss functional as a Lyapunov function whose time derivative along the trajectories of the adapted parameters (?, X?) is guaranteed to be negative definite by the choice of parameter time derivatives shown in equation 3.20. As a consequence of the La Salle invariance principle, the loss functional will decrease in value, and the parameters will converge to the largest invariant set for which the time derivative of the loss functional is identically zero (Khalil, 1996). Equation 3.20 is a set of coupled (between ? and the vectors x?k) nonlinear differential equations that correspond to simultaneous, parallel updating of the estimates ? and x?k. This should be compared to the alternated separate (nonparallel) update rules 3.11 and 3.12 used in the AML algorithm described in section 3.1. Note also that (except for the trace term) the right-hand side of the dictionary learning update in equation 3.20 is of the same form as for the AML update law given in equation 3.11 (see also the discretized version of equation 3.20 given in equation 3.28 below). The key difference is the additional trace term in equation 3.20. This difference corresponds to a projection of the update onto the tangent space of the manifold 3.17, thereby ensuring a unit Frobenius norm (and hence boundedness) of the dictionary estimate at all times and avoiding the ill-posedness problem indicated in equation 3.16. It is also of interest to note that choosing Ωk to be the positive-definite matrix,

(3.23)

in equation 3.20, followed by some matrix manipulations (see equation A.18 in appendix A), yields the alternative algorithm,

(3.24)

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 18

with ηk > 0. In any event (regardless of the specific choice of the positive definite matrices Ωk as shown in appendix A), the proposed algorithm out-lined here converges to a solution (x?k, ?), which satisfies the implicit and nonlinear relationships,

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

(3.25)

for scalar c = tr(?Tδ?). To implement the algorithm 3.20 (or the variant using equation 3.24) in discrete time, a firstorder forward difference approximation at time t = tl can be used,

(3.26)

Applied to equation 3.24, this yields

(3.27)

Similarly, discretizing the ?-update equation yields the A-learning rule, equation 3.28, given below. More generally, taking μl to have a value between zero and one, 0 ≤ μl ≤ 1 yields an updated value x?k[l + 1], which is a linear interpolation between the previous value x?k[l] and . When implemented in discrete time, and setting μl = 1, the resulting Bayesian learning algorithm has the form of a combined iteration where we loop over the operations,

(3.28)

We call this FOCUSS-based, Frobenius-normalized dictionary-learning algorithm the FOCUSS-FDL algorithm. Again, this merged procedure should be compared to the separate iterations involved in the maximum likelihood approach given in equations 3.11 and 3.12. Equation 3.28, with δ? given by equation 3.21, corresponds to performing a finite step-size gradient descent on the manifold . This projection in equation 3.28 of the dictionary update onto the tangent plane of (see the discussion in appendix A) ensures the well-behavedness of the MAP algorithm.7 The specific step-size choice μl = 1, which results in the first equation in equation 3.28, is discussed at length for the low-noise case in Rao and Kreutz-Delgado (1999).

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 19

3.4 Column-Normalized Dictionary Prior Although mathematically very tractable, the unit-Frobenius norm prior, equation 3.17, appears to be somewhat too loose, judging from simulation results given below. In simulations with the Frobenius norm constraint , some columns of A can tend toward zero, a phenomenon that occurs more often in highly overcomplete A. This problem can be understood by remembering that we are using the dp(x), p ≠ 0 diversity measure, which penalizes columns associated with terms in x with large magnitudes. If a column ai has a small relative magnitude, the weight of its coefficient x[i] can be large, and it will be penalized more than a column with a larger norm. This leads to certain columns being underused, which is especially problematic in the overcomplete case. An alternative, and more restrictive, form of the constraint set is obtained by enforcing the requirement that the columns ai of A each be normalized (with respect to the Euclidean 2-norm) to the same constant value (Murray & Kreutz-Delgado, 2001). This constraint can be justified by noting that Ax can be written as the nonunique weighted sum of the columns ai,

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

showing that there is a column-wise ambiguity that remains even after the overall unitFrobenius norm normalization has occurred, as one can now Frobenius-normalize the new matrix A′. Therefore, consider the set of matrices on which has been imposed the column-wise constraint that

(3.29)

The set is an mn ? n = n(m?1)–dimensional submanifold of ?m×n. Note that every column of a matrix in has been normalized to the value . In fact, any constant value for the column normalization can be used (including the unit normalization), but, as shown in appendix B, the particular normalization of results in being a proper submanifold of the mn ? 1 dimensional unit Frobenius manifold ,

7Because of the discrete-time approximation in equation 3.28, and even more generally because of numerical round-off effects in equation 3.20, a renormalization,

is usually performed at regular intervals. Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 20

indicating that a tighter constraint on the matrix A is being imposed. Again, it is assumed that the prior P(A) is uniformly distributed on in the manner of equation 3.13. As shown in appendix B, is simply connected. A learning algorithm is derived for the constraint in appendix B, following much the same approach as in appendix A. Because the derivation of the x?k update to find sparse solutions does not depend on the form of the constraint , only the ? update in algorithm 3.28 needs to be modified. Each column ai is now updated independently (see equation B.17),

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

(3.30)

where δai is the ith column of δ? in equation 3.21. We call the resulting column-normalized dictionary-learning algorithm the FOCUSS-CNDL algorithm. The implementation details of the FOCUSS-CNDL algorithm are presented in section 4.2.

4 Algorithm Implementation

The dictionary learning algorithms derived above are an extension of the FOCUSS algorithm used for obtaining sparse solutions to the linear inverse problem y = Ax to the case where dictionary learning is now required. We refer to these algorithms generally as FOCUSS-DL algorithms, with the unit Frobenius-norm prior-based algorithm denoted by FOCUSS-FDL and the column-normalized prior-base algorithm by FOCUSS-CNDL. In this section, the algorithms are stated in the forms implemented in the experimental tests, where it is shown that the column normalization-based algorithm achieves higher performance in the overcomplete dictionary case. 4.1 Unit Frobenius-Norm Dictionary Learning Algorithm We now summarize the FOCUSS-FDL algorithm derived in section 3.3. For each of the data vectors yk, we update the sparse source vectors xk using the FOCUSS algorithm:

(4.1)

where λk = β(x?k) is the regularization parameter. After updating the N source vectors xk, k = 1, …, n, the dictionary ? is reestimated,

(4.2)

where γ controls the learning rate. For the experiments in section 5, the data block size is N = 100. During each iteration all training vectors are updated using equation 4.1, with a corresponding number of dictionary updates using equation 4.2. After each update of the dictionary ?, it is renormalized to have unit Frobenius norm, ||?||F = 1.

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 21

The learning algorithm is a combined iteration, meaning that the FOCUSS algorithm is allowed to run for only one iteration (not until full convergence) before the A update step. This means that during early iterations, the x?k are in general not sparse. To facilitate learning A, the covariances Σyx? and Σx?x? are calculated with sparsified x?k that have all but the r? largest elements set to zero. The value of r? is usually set to the largest desired number of nonzero elements, but this choice does not appear to be critical. The regularization parameter λk is taken to be a monotonically increasing function of the iteration number,

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

(4.3)

While this choice of λk does not have the optimality properties of the modified L-curve method (see section 2.5), it does not require a one-dimensional optimization for each x?k and so is much less computationally expensive. This is further discussed below. 4.2 Column Normalized Dictionary Learning Algorithm The improved version of the algorithm called FOCUSS-CNDL, which provides increased accuracy especially in the overcomplete case, was proposed in Murray and Kreutz-Delgado (2001). The three key improvements are column normalization that restricts the learned ?, an efficient way of adjusting the regularization parameter λk, and reinitialization to escape from local optima. The column-normalized learning algorithm discussed in section 3.4 and derived in appendix B is used. Because the x?k update does not depend on the constraint set , the FOCUSS algorithm in equation 4.1 is used to update N vectors, as discussed in section 4.1. After every N source vectors are updated, each column of the dictionary is then updated as

(4.4)

where δai are the columns of δ?, which is found using equation 4.2. After updating each ai, it is renormalized to ||ai||2 = 1/n by

(4.5)

which also ensures that ||?||F = 1 as shown in section B.1. The regularization parameter λk may be set independently for each vector in the training set, and a number of methods have been suggested, including quality of fit (which requires a certain level of reconstruction accuracy), sparsity (requiring a certain number of nonzero elements), and the L-curve which attempts to find an optimal trade-off (Engan, 2000). The L-curve method works well, but it requires solving a one-dimensional optimization for each λk, which becomes computationally expensive for large problems. Alternatively, we use a heuristic method that allows the trade-off between error and sparsity to be tuned for each application, while letting each training vector yk have its own regularization parameter λk to improve the quality of the solution,

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 22

(4.6)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

For data vectors that are represented accurately, λk will be large, driving the algorithm to find sparser solutions. If the SNR can be estimated, we can set λmax = (SNR)?1. The optimization problem, 3.15, is concave when p ≤ 1, so there will be multiple local minima. The FOCUSS algorithm is guaranteed to converge only to one of these local minima, but in some cases, it is possible to determine when that has happened by noticing if the sparsity is too low. Periodically (after a large number of iterations), the sparsity of the solutions x?k is checked, and if found too low, x?k is reinitialized randomly. The algorithm is also sensitive to initial conditions, and prior information may be incorporated into the initialization to help convergence to the global solution.

5 Experimental Results

Experiments were performed using complete dictionaries (n = m) and over-complete dictionaries (n > m) on both synthetically generated data and natural images. Performance was measured in a number of ways. With synthetic data, performance measures include the SNR of the recovered sources xk compared to the true generating source and comparing the learned dictionary with the true dictionary. For images of natural scenes, the true underlying sources are not known, so the accuracy and efficiency of the image coding are found. 5.1 Complete Dictionaries: Comparison with ICA To test the FOCUSS-FDL and FOCUSS-CNDL algorithms, simulated data were created following the method of Engan et al. (1999) and Engan (2000). The dictionary A of size 20×20 was created by drawing each element aij from a normal distribution with μ = 0, σ2 = 1 (written as (0, 1)) followed by a normalization to ensure that ||A||F = 1. Sparse source vectors xk, k = 1, …, 1000 were created with r = 4 nonzero elements, where the r nonzero locations are selected at random (uniformly) from the 20 possible locations. The magnitudes of each nonzero element were also drawn from (0, 1) and limited so that |xkl| > 0.1. The input data yk were generated using y = Ax (no noise was added). For the first iteration of the algorithm, the columns of the initialization estimate, ?init, were taken to be the first n = 20 training vectors yk. The initial xk estimates were then set to the pseudoinverse solution . The constant parameters of the algorithm were set as follows: p = 1.0, γ = 1.0, and λmax = 2 × 10?3 (low noise, assumed SNR ≈ 27 dB). The algorithms were run for 200 iterations through the entire data set, and during each iteration, ? was updated after updating 100 data vectors x?k. To measure performance, the SNR between the recovered sources x?k and the the true sources xk were calculated. Each element xk[i] for fixed i was considered as a time-series vector with 1000 elements, and SNRi for each was found using

(5.1)

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 23

The final SNR is found by averaging SNRi over the i = 1, …, 20 vectors and 20 trials of the algorithms. Because the dictionary A is learned only to within a scaling factor and column permutations, the learned sources must be matched with corresponding true sources and scaled to unit norm before the SNR calculation is done.

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

The FOCUSS-FDL and FOCUSS-CNDL algorithms were compared with extended ICA (Lee, Girolami, & Sejnowski, 1999) and Fastica8 (Hyv?rinen et al., 1999). Figure 1 shows the SNR for the tested algorithms. The SNR for FOCUSS-FDL is 27.7 dB, which is a 4.7 dB improvement over extended ICA, and for FOCUSS-CNDL, the SNR is 28.3 dB. The average run time for FOCUSS-FDL/CNDL was 4.3 minutes, for FastICA 0.10 minute, and for extended ICA 0.19 minute on a 1.0 GHz Pentium III Xeon computer. 5.2 Overcomplete Dictionaries To test the ability to recover the true A and xk solutions in the overcomplete case, dictionaries of size 20 × 30 and 64 × 128 were generated. Diversity r was set to fixed values (4 and 7) and uniformly randomly (5–10 and 10–15). The elements of A and the sources xk were created as in section 5.1. The parameters were set as follows: p = 1.0, γ = 1.0, λmax = 2 × 10?3. The algorithms were run for 500 iterations through the entire data set, and during each iteration, ? was updated after updating 100 data vectors x?k. As a measure of performance, we find the number of columns of A that were matched during learning. Because A can be learned only to within column permutations and sign and scale changes, the columns are normalized so that ||?i|| = ||aj|| = 1, and ? is rearranged columnwise so that ?j is given the index of the closest match in A (in the minimum two-norm sense). A match is counted if

(5.2)

Similarly, the number of matching x?k are counted (after rearranging the elements in accordance with the indices of the rearranged ?),

(5.3)

If the data are generated by an A that is not column normalized, other measures of performance need to be used to compare xk and x?k. The performance is summarized in Table 1, which compares the FOCUSS-FDL with the column-normalized algorithm (FOCUSS-CNDL). For the 20 × 30 dictionary, 1000 training vectors were used, and for the 64×128 dictionary 10,000 were used. Results are averaged over four or more trials. For the 64 × 128 matrix and r = 10–15, FOCUSS-CNDL is able to recover 99.5% (127.4/128) of the columns of A and 94.6% (9463/10,000) of the solutions xk to within the tolerance given above. This shows a clear improvement over FOCUSS-FDL, which learns only 80.3% of the A columns and 40.1% of the solutions xk.

8Matlab and C versions of extended ICA can be found on-line at http://www.cnl.salk.edu/~tewon/ICA/code.html. Matlab code for FastICA can be found at: http://www.cis.hut.fi/projects/ica/fastica/. Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 24

Learning curves for one of the trials of this experiment (see Figure 2) show that most of the columns of A are learned quickly within the first 100 iterations, and that the diversity of the solutions drops to the desired level. Figure 2b shows that it takes somewhat longer to learn the xk correctly and that reinitialization of the low sparsity solutions (at iterations 175 and 350) helps to learn additional solutions. Figure 2c shows the diversity at each iteration, measured as the average number of elements of each x?k that are larger than 1 × 10?4. 5.3 Image Data Experiments Previous work has shown that learned basis functions can be used to code data more efficiently than traditional Fourier or wavelet bases (Lewicki & Olshausen, 1999). The algorithm for finding overcomplete bases in Lewicki and Olshausen (1999) is also designed to solve problem 1.1 but differs from our method in a number of ways, including using only the Laplacian prior (p = 1), and using conjugate gradient optimization for finding sparse solutions (whereas we use the FOCUSS algorithm). It is widely believed that overcomplete representations are more efficient than complete bases, but in Lewicki and Olshausen (1999), the overcomplete code was less efficient for image data (measured in bits per pixel entropy), and it was suggested that different priors could be used to improve the efficiency. Here, we show that our algorithm is able to learn more efficient overcomplete codes for priors with p < 1. The training data consisted of 10,000 8 × 8 image patches drawn at random from black and white images of natural scenes. The parameter p was varied from 0.5–1.0, and the FOCUSSCNDL algorithm was trained for 150 iterations. The complete dictionary (64 × 64) was compared with the 2× overcomplete dictionary (64 × 128). Other parameters were set: γ = 0.01, λmax = 2 × 10?3. The coding efficiency was measured using the entropy (bits per pixel) method described in Lewicki and Olshausen (1999). Figure 3 plots the entropy versus reconstruction error (root mean square error, RMSE) and shows that when p < 0.9, the entropy is less for the overcomplete representation at the same RMSE. An example of coding an entire image is shown in Figure 4. The original test image (see Figure 4a) of size 256 × 256 was encoded using the learned dictionaries. Patches from the test image were not used during training. Table 2 gives results for low- and high-compression cases. In both cases, coding with the overcomplete dictionary (64×128) gives higher compression (lower bits per pixel) and lower error (RMSE). For the high-compression case (see Figures 4b and 4c), the 64 × 128 overcomplete dictionary gives compression of 0.777 bits per pixel at error 0.328, compared to the 64 × 64 complete dictionary at 0.826 bits per pixel at error 0.329. The amount of compression was selected by adjusting λmax (the upper limit of the regularization parameter). For high compression, λmax = 0.02, and for low compression, λmax = 0.002.

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

6 Discussion and Conclusions

We have applied a variety of tools and perspectives (including ideas drawn from Bayesian estimation theory, nonlinear regularized optimization, and the theory of majorization and convex analysis) to the problem of developing algorithms capable of simultaneously learning overcomplete dictionaries and solving sparse source-vector inverse problems. The test experiment described in section 5.2 is a difficult problem designed to determine if the proposed learning algorithm can solve for the known true solutions for A and the spare source vectors xk to the underdetermined inverse problem yk = Axk. Such testing, which does not appear to be regularly done in the literature, shows how well an algorithm can extract stable and categorically meaningful solutions from synthetic data. The ability to perform well on such test inverse problems would appear to be at least a necessary condition for an algorithm to be trustworthy in domains where a physically or biologically meaningful sparse solution is sought, such as occurs in biomedical imaging, geophysical seismic sounding, and multitarget tracking.

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 25

The experimental results presented in section 5.2 show that the FOCUSS-DL algorithms can recover the dictionary and the sparse sources vectors. This is particularly gratifying when one considers that little or no optimization of the algorithm parameters has been done. Furthermore, the convergence proofs given in the appendixes show convergence only to local optima, whereas one expects that there will be many local optima in the cost function because of the concave prior and the generally multimodal form of the cost function. One should note that algorithm 3.28 was constructed precisely with the goal of solving inverse problems of the type considered here, and therefore one must be careful when comparing the results given here with other algorithms reported in the literature. For instance, the mixtureof-gaussians prior used in Attias (1999) does not necessarily enforce sparsity. While other algorithms in the literature might perform well on this test experiment, to the best of our knowledge, possible comparably performing algorithms such as Attias (1999),Girolami (2001),Hyv?rinen et al. (1999), and Lewicki and Olshausen (1999) have not been tested on large, overcomplete matrices to determine their accuracy in recovering A, and so any comparison along these lines would be premature. In section 5.1, the FOCUSS-DL algorithms were compared to the well-known extended ICA and FastICA algorithms in a more conventional test with complete dictionaries. Performance was measured in terms of the accuracy (SNR) of the recovered sources xk, and both FOCUSS-DL algorithms were found to have significantly better performance (albeit with longer run times). We have also shown that the FOCUSS-CNDL algorithm can learn an over-complete representation, which can encode natural images more efficiently than complete bases learned from data (which in turn are more efficient than standard nonadaptive bases, such as Fourier or wavelet bases; Lewicki & Olshausen, 1999). Studies of the human visual cortex have shown a higher degree of overrepresentation of the fovea compared to the other mammals, which suggests an interesting connection between overcomplete representations and visual acuity and recognition abilities (Popovic & Sj?strand, 2001). Because the coupled dictionary learning and sparse-inverse solving algorithms are merged and run in parallel, it should be possible to run the algorithms in real time to track dictionary evolution in quasistationary environments once the algorithm has essentially converged. One way to do this would be to constantly present randomly encountered new signals, yk, to the algorithm at each iteration instead of the original training set. One also has to ensure that the dictionary learning algorithm is sensitive to the new data so that dictionary tracking can occur. This would be done by an appropriate adaptive filtering of the current dictionary estimate driven by the new-data derived corrections, similar to techniques used in the adaptive filtering literature (Kalouptsidis & Theodoridis, 1993).

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Acknowledgments

This research was partially supported by NSF grant CCR-9902961. K. K.-D. also acknowledges the support provided by the Computational Neurobiology Laboratory, Salk Institute, during a 1998–1999 sabbatical visit.

References

Adler, J.; Rao, B.; Kreutz-Delgado, K. Comparison of basis selection methods. Proceedings of the 30th Asilomar Conference on Signals, Systems, and Computers; Pacific Grove, CA: IEEE; 1996. Attias H. Independent factor analysis. Neural Computation 1999;11(4):803–851. [PubMed: 10226184] Basilevsky, A. Statistical factor analysis and related methods: Theory and applications. New York: Wiley; 1994. Bell A, Sejnowski T. An information-maximization approach to blind separation and blind deconvolution. Neural Computation 1995;7:1129–1159. [PubMed: 7584893]

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 26

Coifman R, Wickerhauser M. Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory 1992;IT-38(2):713–718. Comon P. Independent component analysis: A new concept? Signal Processing 1994;36:287–314. Deco, G.; Obradovic, D. An information-theoretic approach to neural computing. Berlin: SpringerVerlag; 1996. Dhrymes, P. Mathematics for econometrics. 2. New York: Springer-Verlag; 1984. Dirac, PAM. The principles of quantum mechanics. Oxford: Clarendon Press; 1958. Donoho, D. On minimum entropy segmentation. In: Chui, C.; Montefusco, L.; Puccio, L., editors. Wavelets: Theory, algorithms, and applications. San Diego, CA: Academic Press; 1994. p. 233-269. Donoho D. De-noising by soft-thresholding. IEEE Trans on Information Theory 1995;41:613–627. Engan, K. Unpublished doctoral dissertation. Stavanger Universty College; Norway: 2000. Frame based signal representation and compression. Engan, K.; Rao, B.; Kreutz-Delgado, K. Frame design using FOCUSS with method of optimal directions (MOD). Proc. NORSIG-99; Tronheim, Norway: IEEE; 1999. Engan, K.; Rao, BD.; Kreutz-Delgado, K. Regularized FOCUSS for subset selection in noise. Proc. NORSIG 2000; Sweden. June, 2000; Kolmarden, Sweden: IEEE; 2000. p. 247-250. Field D. What is the goal of sensory coding. Neural Computation 1994;6:559–599. Girolami M. A variational method for learning sparse and overcomplete representations. Neural Computation 2001;13:2517–2532. [PubMed: 11674849] Gorodnitsky I, George J, Rao B. Neuromagnetic source imaging with FOCUSS: A recursive weighted minimum norm algorithm. Journal of Electroencephalography and Clinical Neurophysiology 1995;95(4):231–251. Gorodnitsky I, Rao B. Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm. IEEE Trans Signal Processing 1997;45(3):600–616. Hansen PC, O’Leary DP. The use of the L-curve in the regularization of discrete ill-posed problems. SIAM J Sct Comput 1993;14:1487–1503. Hyv?rinen, A.; Cristescu, R.; Oja, E. A fast algorithm for estimating overcomplete ICA bases for image windows. Proc. Int. Joint Conf. on Neural Networks (IJCNN’99); Washington, DC: 1999. p. 894-899. Kalouptsidis, T.; Theodoridis, S. Adaptive system identification and signal processing algorithms. Upper Saddle River, NJ: Prentice Hall; 1993. Kassam, S. Signal detection in non-gaussian noise. New York: Springer-Verlag; 1982. Khalil, H. Nonlinear systems. 2. Upper Saddle River, NJ: Prentice Hall; 1996. Kreutz-Delgado, K.; Rao, B. A general approach to sparse basis selection: Majorization, concavity, and affine Scaling (Full Report with Proofs) (Center for Information Engineering Report No. UCSDCIE-97-7-1). San Diego: University of California; 1997. Available on-line at: http://dsp.ucsd.edu/~kreutz Kreutz-Delgado, K.; Rao, B. A new algorithm and entropy-like measures for sparse coding. Proc. 5th Joint Symp. Neural Computation; La Jolla, CA. 1998a. Kreutz-Delgado, K.; Rao, B. Gradient factorization-based algorithm for best-basis selection. Proceedings of the 8th IEEE Digital Signal Processing Workshop; New York: IEEE; 1998b. Kreutz-Delgado, K.; Rao, B. Measures and algorithms for best basis selection. Proceedings of the 1998 ICASSP; New York: IEEE; 1998c. Kreutz-Delgado, K.; Rao, B. Sparse basis selection, ICA, and majorization: Towards a unified perspective. Proc. 1999 ICASSP; Phoenix, AZ. 1999. Kreutz-Delgado, K.; Rao, B.; Engan, K. Novel algorithms for learning overcomplete Dictionaries. Proc. 1999 Asilomar Conference; Pacific Grove, CA: IEEE; 1999. Kreutz-Delgado, K.; Rao, B.; Engan, K.; Lee, T-W.; Sejnowski, T. Convex/Schur–convex (CSC) logpriors and sparse coding. Proc. 6th Joint Symposium on Neural Computation; Pasadena, CA. 1999a. Kreutz-Delgado, K.; Rao, B.; Engan, K.; Lee, T-W.; Sejnowski, T. Learning overcomplete dictionaries: Convex/Schur-convex (CSC) log-priors, factorial codes, and independent/dependent component analysis (I/DCA). Proc. 6th Joint Symposium on Neural Computation; Pasadena, CA. 1999b.

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 27

Lee TW, Girolami M, Sejnowski T. Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation 1999;11(2):417– 441. [PubMed: 9950738] Lewicki MS, Olshausen BA. A probabilistic framework for the adaptation and comparison of image codes. J Opt Soc Am A 1999;16(7):1587–1601. Lewicki MS, Sejnowski TJ. Learning overcomplete representations. Neural Computation 2000;12(2): 337–365. [PubMed: 10636946] Mallat S, Zhang Z. Matching pursuits with time-frequency dictionaries. IEEE Trans ASSP 1993;41(12): 3397–3415. Marshall, A.; Olkin, I. Inequalities: Theory of majorization and its applications. San Diego, CA: Academic Press; 1979. Murray, JF.; Kreutz-Delgado, K. An improved FOCUSS-based learning algorithm for solving sparse linear inverse problems. Conference Record of the 35th Asilomar Conference on Signals, Systems and Computers; Pacific Grove, CA: IEEE; 2001. Nadal JP, Parga N. Nonlinear neurons in the low noise limit: A factorial code maximizes information transfer. Network 1994;5:565–581. Olshausen B, Field D. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 1996;381:607–609. [PubMed: 8637596] Olshausen B, Field D. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research 1997;37:3311–3325. [PubMed: 9425546] Pham D. Blind separation of instantaneous mixture of sources via an independent component analysis. Signal Processing 1996;44(11):2768–2779. Popovic Z, Sj?strand J. Resolution, separation of retinal ganglion cells, and cortical magnification in humans. Vision Research 2001;41:1313–1319. [PubMed: 11322976] Rao, B. Analysis and extensions of the FOCUSS algorithm. Proceedings of the 1996 Asilomar Conference on Circuits, Systems, and Computers; New York: IEEE; 1997. p. 1218-1223. Rao, B. Signal processing with the sparseness constraint. Proc. 1998 ICASSP; New York: IEEE; 1998. Rao, BD.; Engan, K.; Cotter, SF.; Kreutz-Delgado, K. Subset selection in noise based on diversity measure minimization. 2002. Manuscript submitted for publication Rao, B.; Gorodnitsky, I. Affine scaling transformation based methods for computing low complexity sparse solutions. Proc. 1996 ICASSP; New York: IEEE; 1997. p. 1783-1786. Rao, B.; Kreutz-Delgado, K. In: Neyman, J., editor. Deriving algorithms for computing sparse solutions to linear inverse problems; Proceedings of the 1997 Asilomar Conference on Circuits, Systems, and Computers; New York: IEEE; 1997. Rao, B.; Kreutz-Delgado, K. Basis selection in the presence of noise. Proceedings of the 1998 Asilomar Conference; New York: IEEE; 1998a. Rao, B.; Kreutz-Delgado, K. Sparse solutions to linear inverse problems with multiple measurement vectors. Proceedings of the 8th IEEE Digital Signal Processing Workshop; New York: IEEE; 1998b. Rao B, Kreutz-Delgado K. An affine scaling methodology for best basis selection. IEEE Trans Signal Processing 1999;1:187–202. Roberts S. Independent component analysis: Source assessment and separation, a Bayesian approach. IEE (Brit) Proc Vis Image Signal Processing 1998;145(3):149–154. Rockafellar, R. Convex analysis. Princeton, NJ: Princeton University Press; 1970. Ruderman D. The statistics of natural images. Network: Computation in Neural Systems 1994;5:517– 548. Wang K, Lee C, Juang B. Selective feature extraction via signal decomposition. IEEE Signal Processing Letters 1997;4(1):8–11. Watanabe S. Pattern recognition as a quest for minimum entropy. Pattern Recognition 1981;13(5):381– 387. Wickerhauser, M. Adapted wavelet analysis from theory to software. Wellesley, MA: A. K. Peters; 1994. Zhu S, Wu Y, Mumford D. Minimax entropy principle and its application to texture modeling. Neural Computation 1997;9:1627–1660.

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 28

Zibulevsky M, Pearlmutter BA. Blind source separation by sparse decomposition in a signal dictionary. Neural Computation 2001;13(4):863–882. [PubMed: 11255573]

Appendix A: The Frobenius-Normalized Prior Learning Algorithm

Here we provide a derivation of the algorithm 3.20–3.21 and prove that it converges to a local minimum of equation 3.15 on the manifold = {A | ||A||F = 1} ? ?m×n defined in equation 3.17. Although we focus on the development of the learning algorithm on , the derivations in sections A.2 and A.3, and the beginning of subsection are done for a general constraint manifold .

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

A.1 Admissible Matrix Derivatives

A.1.1 The Constraint Manifold In order to determine the structural form of admissible derivatives, for matrices belonging to ,9 it is useful to view as embedded in the finite-dimensional Hilbert space of matrices, ?m×n, with inner product

The corresponding matrix norm is the Frobenius norm,

We will call this space the Frobenius space and the associated inner product the Frobenius inner product. It is useful to note the isometry,

where A is the mn-vector formed by stacking the columns of A. Henceforth, bold type represents the stacked version of a matrix (e.g., B = vec(B)). The stacked vector A belongs to the standard Hilbert space ?mn, which we shall henceforth refer to as the stacked space. This space has the standard Euclidean inner product and norm,

It is straightforward to show that

In particular, we have

9Equivalently, we want to determine the structure of elements ? of the tangent space, T

, to the smooth manifold

at the point A.

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 29

Thus, the manifold in equation 3.17 corresponds to the (mn?1)–dimensional unit sphere in the stacked space, ?mn (which, with a slight abuse of notation, we will continue to denote by ). It is evident that is simply connected so that a path exists between any two elements of and, in particular, a path exists between any initial value for a dictionary, Ainit ∈ , used to initialize a learning algorithm, and a desired target value, Afinal ∈ .10 A.1.2 Derivatives on : The Tangent Space T

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Determining the form of admissible derivatives on equation 3.17 is equivalent to determining the form of admissible derivatives on the unit ?mn–sphere. On the unit sphere, we have the well-known fact that

This shows that the general form of ? is ? = ΛQ, where Q is arbitrary and

(A.1)

is the stacked space projection operator onto the tangent space of the unit ?mn–sphere at the point A (note that we used the fact that ||A|| = 1). The projection operator Λ is necessarily idempotent, Λ = Λ2. Λ is also self-adjoint, Λ = Λ*, where the adjoint operator Λ* is defined by the requirement that

showing that Λ is an orthogonal projection operator. In this case, Λ* = ΛT, so that selfadjointness corresponds to Λ being symmetric. One can readily show that an idempotent, selfadjoint operator is nonnegative, which in this case corresponds to the symmetric, idempotent operator Λ being a positive semidefinite matrix. This projection can be easily rewritten in the Frobenius space,

10For example, for 0 ≤ t ≤ 1, take the t–parameterized path,

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 30

(A.2)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Of course, this result can be derived directly in the Frobenius space using the fact that

from which it is directly evident that

(A.3)

and therefore ? must be of the form11

(A.4)

One can verify that Λ is idempotent and self-adjoint and is therefore a nonnegative, orthogonal projection operator. It is the orthogonal projection operator from ?m×n onto the tangent space T . In the stacked space (with some additional abuse of notation), we represent the quadratic form for positive semidefinite symmetric matrices W as

Note that this defines a weighted norm if, and only if, W is positive definite, which might not be the case by definition. In particular, when W = Λ, the quadratic form semidefinite. Finally, note from equation A.4 that ?A ∈ , is only positive

(A.5)

A.2 Minimizing the Loss Function over a General Manifold

Consider the Lyapunov function,

11It must be the case that

, using the physicist’s “bra-ket” notation (Dirac, 1958).

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 31

(A.6)

where is some arbitrary but otherwise appropriately defined constraint manifold associated with the prior, equation 3.13. Note that this is precisely the loss function to be minimized in equation 3.15. If we can determine smooth parameter trajectories (i.e., a parameter-vector adaptation rule) (?, ?) such that along these trajectories V?(X, A) ≤ 0, then as a consequence of the La Salle invariance principle (Khalil, 1996), the parameter values will converge to the largest invariant set (of the adaptation rule viewed as a nonlinear dynamical system) contained in the set

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

(A.7)

The set Γ contains the local minima of VN. With some additional technical assumptions (generally dependent on the choice of adaptation rule), the elements of Γ will contain only local minima of VN. Assuming the i.i.d. q = 2 gaussian measurement noise case of equation 3.8,12 the loss (Lyapunov) function to be minimized is then

(A.8)

which is essentially the loss function to be minimized in equation 3.15.13 Suppose for the moment, as in equations 3.4 through 3.9, that X is assumed to be known, and note that then (ignoring constant terms depending on X and Y) VN can be rewritten as

for Σxx and defined as in equation 3.10. Using standard results from matrix calculus (Dhrymes, 1984), we can show that VN(A) is minimized by the solution 3.9. This is done by setting

12Note that in the appendix, unlike the notation used in equations 3.8 et seq., the “hat” notation has been dropped. Nonetheless, it should be understood that the quantities A and X are unknown parameters to be optimized over in equation 3.15, while the measured signal vectors Y are known. 13The factor is added for notational convenience and does not materially affect the derivation. Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 32

and using the identities (valid for W symmetric)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

This yields (assuming that Σxx is invertible),

which is equation 3.9, as claimed. For Σxx nonsingular, the solution is unique and globally optimal. This is, of course, a well-known result. Now return to the general case, equation A.8, where both X and A are unknown. For the data indexed by k = 1, …, N, define the quantities

The loss function and its time derivative can be written,

Then, to determine an appropriate adaptation rule, note that

(A.9)

where

(A.10)

and

(A.11)

Enforcing the separate conditions

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 33

(A.12)

(as well as the additional condition that A ∈ ) will be sufficient to ensure that V?N ≤ 0 on In this case, the solution-containing set Γ of equation A.7 is given by

.

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

(A.13)

Note that if A is known and fixed, then T2 ≡ 0, and only the first condition of equation A.12 (which enforces learning of the source vectors, xk) is of concern. Contrawise, if source vectors xk, which ensure that e(xk) = 0 are fixed and known, then T1 ≡ 0, and the second condition of equation A.12 (which enforces learning of the dictionary matrix, A) is at issue.

A.3 Obtaining the xk Solutions with the FOCUSS Algorithm

We now develop the gradient factorization-based derivation of the FOCUSS algorithm, which provides estimates of xk while satisfying the first convergence condition of equation A.12. The constraint manifold is still assumed to be arbitrary. To enforce the condition T1 ≤ 0 and derive the first adaptation rule given in equation 3.20, we note that we can factor ?dk = ?d (xk) as (Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999)

with αk = αxk > 0 and Πxk = Πk positive definite and diagonal for all nonzero xk. Then, defining βk = λαk > 0 and selecting an arbitrary set of (adaptable) symmetric positive-definite matrices Ωk, we choose the learning rule

(A.14)

which is the adaptation rule for the state estimates xk = x?k given in the first line of equation 3.20. With this choice, we obtain

(A.15)

as desired. Assuming convergence to the set A.13 (which will be seen to be the case after we show below how to ensure that we also have T2 ≤ 0), we will asymptotically obtain (reintroducing the “hat” notation to now denote converged parameter estimates)

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 34

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

which is equivalent to

(A.16)

at convergence. This is also equivalent to the condition given in the first line of equation 3.25, as shown below. Exploiting the fact that Ωk in equation A.14 are arbitrary (subject to the symmetry and positivedefiniteness constraint), let us make the specific choice shown in equation 3.23,

(A.17)

Also note the (trivial) identity,

which can be recast nontrivially as

(A.18)

With equations A.17 and A.18, the learning rule, A.14, can be recast as

(A.19)

which is the alternative learning algorithm, 3.24. At convergence (when T1 ≡ 0) we have the condition shown in the first line of equation 3.25,

(A.20)

This also follows from the convergence condition A.16 and the identity A.18, showing that the result, A.20, is independent of the specific choice of Ωk > 0. Note from equations A.14 and A. 15 that T1 ≡ 0 also results in ?k ≡ 0 for k = 1, …, N, so that we will have converged to constant values, x?k, which satisfy equation A.20. For the case of known, fixed A, the learning rule derived here will converge to sparse solutions, xk, and when discretized as in section 3.3, yields equation 3.27, which is the known dictionary FOCUSS algorithm (Rao & Kreutz-Delgado, 1998a, 1999).

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 35

Note that the derivation of the sparse source-vector learning algorithm here, which enforces the condition T1 ≤ 0, is entirely independent of any constraints placed on A (such as, for example, the unit Frobenius-norm and column-norm constraints considered in this article) or of the form of the A-learning rule. Thus alternative choices of constraints placed on A, as considered in Murray and Kreutz-Delgado (2001) and described in appendix B, will not change the form of the xk-learning rule derived here. Of course, because the xk learning rule is strongly coupled to the A-learning rule, algorithmic performance and speed of convergence may well be highly sensitive to conditions placed on A and the specific A learning algorithm used.

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

A.4 Learning the Dictionary A

A.4.1 General Results We now turn to the enforcement of the second convergence condition, T2 ≤ 0 and the development of the dictionary adaptation rule shown in equation 3.20. First, as in equation 3.21, we define the error term δA as

(A.21)

using the fact that e(x) = Ax ? y. Then, from equation A.11, we have

(A.22)

So far, these steps are independent of any specific constraints that may be placed on A, other than that the manifold be smooth and compact. With A constrained to lie on a specified smooth, compact manifold, to ensure correct learning behavior, it is sufficient to impose the constraint that ? lies in the tangent space to the manifold and the condition that T2 ≤ 0. A.4.2 Learning on the Unit Frobenius Sphere, To ensure that T2 ≤ 0 and that ? is in the tangent space of the unit sphere in the Frobenius space ?mn, we take

(A.23)

which is the adaptation rule given in equation 3.20. With this choice, and using the positive semidefiniteness of Λ, we have

as required. Note that at convergence, the condition T2 ≡ 0, yields ? ≡ 0, so that we will have converged to constant values for the dictionary elements, and

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 36

(A.24)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

from equation A.5, where c = tr(?Tδ?). Thus, the steady-state solution is

(A.25)

Note that equations A.20 and A.25 are the steady-state values given earlier in equation 3.25.

Appendix B: Convergence of the Column–Normalized Learning Algorithm

The derivation and proof of convergence of the column-normalized learning algorithm, applicable to learning members of , is accomplished by appropriately modifying key steps of the development given in appendix A. As in appendix A, the standard Euclidean norm and inner product apply to column vectors, while the Frobenius norm and inner product apply to m × n matrix elements of ?m×n.

B.1 Admissible Matrix Derivatives

B.1.1 The Constraint Manifold Let ei = (0 · · · 0 1 0 · · ·0)T ∈ ?n be the canonical unit vector whose components are all zero except for the value 1 in the ith location. Then

Note that

where

(B.1)

Note that only the ith column of Mi is nonzero and is equal to ai = ith column of A. We therefore have that

(B.2)

Also, for i, j = 1, …, n,

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 37

(B.3)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

where δi,j is the Kronecker delta. Note, in particular, that ||ai|| = ||Mi||. Let A ∈ , where is the set of column-normalized matrices, as defined by equation 3.29. is an mn ? n = n(m?1)–dimensional submanifold of the Frobenius space ?m×n as each of the n columns of A is normalized as

and

using linearity of the trace function and property, B.3. It is evident, therefore, that

for

defined by equation 3.17.

We can readily show that is simply connected, so that a continuous path exists between any matrix A = Ainit ∈ to any other column-normalized matrix A′ = AFinal ∈ . Indeed, let A and A′ be such that

There is obviously a continuous path entirely in

from A ∈

to the intermediate matrix from [ ]

. Similarly, there is a continuous path entirely in

to [ ], and so on to . To summarize, is a simply connected (nm ? n)–dimensional submanifold of the simply connected (nm ? 1)–dimensional manifold and they are both submanifolds of the (nm)–dimensional Frobenius space ?m×n. B.1.2 Derivatives on : The Tangent Space T

For convenience, for i = 1, …, n define

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 38

and

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Note that equation B.3 yields

(B.4)

For any A ∈

we have for each i = 1, …, n,

or

(B.5)

In fact,

(B.6)

Note from equations B.2 and B.5 that

showing that (see equation A.3) T

?T

, as expected from the fact that

?

.

An important and readily proven fact is that for each i = 1, …, n,

(B.7)

for some matrix Qi and projection operator, Pi, defined by

(B.8)

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 39

From equation B.4, it can be shown that the projection operators commute,

(B.9)

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

and are idempotent,

(B.10)

Indeed, it is straightforward to show from equation B.4 that for all Q,

(B.11)

and

(B.12)

It can also be shown that

showing that Pi is self–adjoint, Define the operator,

.

(B.13)

Note that because of the commutativity of the Pi, the order of multiplication on the right-hand side of equation B.11 is immaterial, and it is easily determined that P is idempotent,

By induction on equation B.11, it is readily shown that

(B.14)

Either from the self-adjointness and idempotency of each Pi, or directly from equation B.14, it can be shown that P itself is self–adjoint, P = P*,

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 40

Thus, P is the orthogonal projection operator from ?m×n onto T As a consequence, we have the key result that

.

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

(B.15)

for some matrix Q ∈ ?m×n. This follows from the fact that given Q, then the right-hand side of equation B.6 is true for ? = PQ. On the other hand, if the right-hand side of equation B.6 is true for ?, we can take Q = ? in equation B.15. Note that for the T –projection operator Λ given by equation A.2, we have that

consistent with the fact that T

?T

.

Let qi, i = 1, …, n be the columns of Q = [q1, …, qn]. The operation Q′ = PjQ, corresponds to

while the operation Q′= PQ, corresponds to

B.2 Learning on the Manifold

The development of equations A.6 through A.22 is independent of the precise nature of the constraint manifold and can be applied here to the specific case of = . To ensure that T2 ≤ 0 in equation A.22 and that ? ∈ T , we can use the learning rule,

(B.16)

for μ > 0 and δA given by equation A.21. With the ith column of δA denoted by δai, this corresponds to the learning rule,

(B.17)

With rule B.16, we have

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 41

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

where we have explicitly shown that the idempotency and self-adjointness of P correspond to its being a nonnegative operator. Thus, we will have convergence to the largest invariant set for which ? ≡ 0, which from equation B.16 is equivalent to the set for which

(B.18)

This, in turn, is equivalent to

with

An equivalent statement to equation B.19 is

Defining the diagonal matrix

and recalling the definitions A.21 and B.1, we obtain from equation B.19 that

which can be solved as

(B.20)

This is the general form of the solution found by the

–learning algorithm.

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 42

HHMI Author Manuscript

Figure 1.

Comparison between FOCUSS-FDL, FOCUSS-CNDL, extended ICA, and FastICA on synthetically generated data with a complete dictionary A, size 20 × 20. The SNR was computed between the recovered sources x?k and the true sources xk. The mean SNR and standard deviation were computed over 20 trials.

HHMI Author Manuscript HHMI Author Manuscript

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Kreutz-Delgado et al.

Page 43

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Figure 2.

Performance of the FOCUSS-CNDL algorithm with overcomplete dictionary A, size 64 × 128. (a) Number of correctly learned columns of A at each iteration. (b) Number of sources xk learned. (c) Average diversity (n-sparsity) of the xk. The spikes in b and c indicate where some solutions x?k were reinitialized because they were not sparse enough.

Kreutz-Delgado et al.

Page 44

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Figure 3.

Comparing the coding efficiency of complete and 2× overcomplete representations on 8×8 pixel patches drawn from natural images. The points on the curve are the results from different values of p, at the bottom right, p = 1.0, and at the top left, p = 0.5. For smaller p, the overcomplete case is more efficient at the same level of reconstruction error (RMSE).

Kreutz-Delgado et al.

Page 45

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Figure 4.

Image compression using complete and overcomplete dictionaries. Coding with an overcomplete dictionary is more efficient (fewer bits per pixel) and more accurate (lower RMSE). (a) Original image of size 256 × 256 pixels. (b) Compressed with 64×64 complete dictionary to 0.826 bits per pixel at RMSE = 0.329. (c) Compressed with 64 × 128 overcomplete dictionary to 0.777 bits per pixel at RMSE = 0.328.

Neural Comput. Author manuscript; available in PMC 2010 September 23.

Table 1

Synthetic Data Results.

Learned A Columns Size of A 20 × 30 20 × 30 64 × 128 64 × 128 64 × 128 64 × 128 10–15 127.4 1.3 99.5 9463.4 330.3 94.6 10–15 102.8 4.5 80.3 4009.6 499.6 40.1 5–10 126.3 1.3 98.6 9505.5 263.8 95.1 7 125.3 2.1 97.9 9414.0 406.5 94.1 7 28.9 1.6 96.2 846.8 97.6 84.7 7 25.3 3.4 84.2 675.9 141.0 67.6 Diversity, r Average SD % Average SD % Learned x

Algorithm

Kreutz-Delgado et al.

FOCUSS-FDL

FOCUSS-CNDL

FOCUSS-CNDL

FOCUSS-CNDL

FOCUSS-FDL

Neural Comput. Author manuscript; available in PMC 2010 September 23.

FOCUSS-CNDL

HHMI Author Manuscript

Page 46

HHMI Author Manuscript

HHMI Author Manuscript

Kreutz-Delgado et al.

Page 47

Table 2

Image Compression Results.

HHMI Author Manuscript HHMI Author Manuscript HHMI Author Manuscript

Dictionary Size 64 × 64 64 × 128 64 × 64 64 × 128

p 0.5 0.6 0.5 0.6

Compression (bits/pixel) 2.450 2.410 0.826 0.777

RMSE 0.148 0.141 0.329 0.328

Average Diversity 17.3 15.4 4.5 4.0

Neural Comput. Author manuscript; available in PMC 2010 September 23.