1. Introduction
In recent years, finite-length analyses for coding problems have been attracting considerable attention [1]. This paper focuses on finite-length analyses for two representative coding problems: One is source coding with side-information for Markov sources, i.e., the Markov–Slepian–Wolf problem on the systemXnwith full side-informationYnat the decoder, where only the decoder observes the side-information and the source and the side-information are a joint Markov chain. The other is channel coding for channels with Markovian conditional additive noise. Although the main purpose of this paper is finite-length analyses, we also present a unified approach we developed to investigate these topics including asymptotic analyses. Since this discussion is spread across a number of subtopics, we explain them separately in the Introduction.
1.1. Two Aspects of Finite-Length Analysis
We explain the motivations of this research by starting with two aspects of finite-length analysis that must be argued when finite-length bounds are proposed. For concreteness, we consider channel coding here even though the problems treated in this paper are not restricted to channel coding. To date, many types of finite-length achievability bounds have been proposed. For example, Verdú and Han derived a finite-length bound by using the information-spectrum approach in order to derive the general formula [2] (see also [3]), which we term as the information-spectrum bound. One of the authors and Nagaoka derived a bound (for the classical-quantum channel) by relating the error probability to binary hypothesis testing [4] (Remark 15) (see also [5]), which we refer to as the hypothesis-testing bound. Polyanskiy et. al. derived the random coding union (RCU) bound and the dependence testing (DT) bound [1] (a bound slightly looser (coefficients are worse) than the DT bound can be derived from the hypothesis-testing bound of [4]). Moreover, Gallager’s bound [6] is known as an efficient bound to derive the exponentially decreasing rate.
Here, we focus on two important aspects of finite-length analysis:
(A1) Computational complexity for the bound and
(A2) Asymptotic tightness for the bound.
Both aspects are required for the bound in finite-length analysis as follows. As the first aspect, we consider the computational complexity for the bound. For the BSC (binary symmetric channel), the computational complexity of the RCU bound isO(n2), and that of the DT bound isO(n) [7]. However, the computational complexities of these bounds are much larger for general DMCs (discrete memoryless channels) or channels with memory. It is known that the hypothesis testing bound can be described as a linear programming problem (e.g., see [8,9] (in the the case of a quantum channel, the bound is described as a semi-definite programming problem)) and can be efficiently computed under certain symmetry. However, the number of variables in the linear programming problem grows exponentially with the block length, and it is difficult to compute in general. The computation of the information-spectrum bound depends on the evaluation of the tail probability. The hypothesis testing bound gives a tighter bound than the information-spectrum bound, as pointed out by [8], and the computational complexity of the former is much smaller than that of the latter. However, the computation of the tail probability continues to remain challenging unless the channel is a DMC. For DMCs, the computational complexity of Gallager’s bound isO(1) since the Gallager function is an additive quantity for DMCs. However, this is not the case if there is a memory (the Gallager bound for finite-state channels was considered in [10] (Section 5.9), but a closed form expression for the exponent was not derived). Consequently, no efficiently computable bound currently exists for channel coding with Markov additive noise. The situation is the same for source coding with side-information.
Since the actual computation time may depend on the computational resource we can use for numerical experiment, it is not possible to provide a concrete requirement of computational complexity. However, in order to conduct a numerical experiment for a meaningful blocklength, it is reasonable to require the computational complexity to be, at most, a polynomial order of the blocklength n.
Next, let us consider the second aspect, i.e., asymptotic tightness. Thus far, three kinds of asymptotic regimes have been studied in information theory [1,11,12,13,14,15,16]:
-
A large deviation regime in which the error probabilityεasymptotically behaves ase−nrfor somer>0;
-
A moderate deviation regime in whichεasymptotically behaves ase−n1−2trfor somer>0andt∈(0,1/2); and
-
A second-order regime in whichεis a constant.
We shall claim that a good finite-length bound should be asymptotically optimal for at least one of the above-mentioned three regimes. In fact, the information-spectrum bound, the hypothesis-testing bound, and the DT bound are asymptotically optimal in both the moderate deviation and second-order regimes, whereas the Gallager bound is asymptotically optimal in the large deviation regime and the RCU bound asymptotically optimal in all the regimes (Both the Gallager and RCU bounds are asymptotically optimal in the large deviation regime only up to the critical rate). Recently, for DMCs, Yang and Meng derived an efficiently computable bound for low-density parity check (LDPC) codes [17], which is asymptotically optimal in both the moderate deviation and second-order regimes.
1.2. Main Contribution for Finite-Length Analysis
We derive the finite-length achievability bounds on the problems by basically using the exponential-type bounds (for channel coding, it corresponds to the Gallager bound.). In source coding with side-information, the exponential-type upper bounds on error probabilityP¯e(Mn)for a given message sizeMnare described by using the conditional Rényi entropies as follows (cf. Lemmas 14 and 15):
P¯e(Mn)≤inf−12≤θ≤0Mnθ1+θ e−θ1+θH1+θ↑(Xn|Yn)
and:
P¯e(Mn)≤inf−1≤θ≤0Mnθ e−θH1+θ↓(Xn|Yn).
Here,Xnis the information to be compressed andYnis the side-information that can be accessed only by the decoder.H1+θ↑(Xn|Yn) is the conditional Rényi entropy introduced by Arimoto [18], which we shall refer to as the upper conditional Rényi entropy (cf. (12)). On the other hand,H1+θ↓(Xn|Yn) is the conditional Rényi entropy introduced in [19], which we shall refer to as the lower conditional Rényi entropy (cf. (7)). Although there are several other definitions of conditional Rényi entropies, we only use these two in this paper; see [20,21] for an extensive review on conditional Rényi entropies.
Although the above-mentioned conditional Rényi entropies are additive for i.i.d. random variables, they are not additive for joint Markov chains overXnandYn , for which the derivation of finite-length bounds for Markov chains are challenging. Because it is generally not easy to evaluate the conditional Rényi entropies for Markov chains, we consider two assumptions in relation to transition matrices: the first assumption, which we refer to as non-hidden, is that the Y-marginal process is a Markov chain, which enables us to derive the single-letter expression of the conditional entropy rate and the lower conditional Rényi entropy rate; the second assumption, which we refer to as strongly non-hidden, enables us to derive the single-letter expression of the upper conditional Rényi entropy rate; see Assumptions 1 and 2 of Section 2 for more detail (Indeed, as explained later, our result on the data compression can be converted to a result on the channel coding for a specific class of channels. Under this conversion, we obtain certain assumptions for channels. As explained later, these assumptions for channels are more meaningful from a practical point of view.). Under Assumption 1, we introduce the lower conditional Rényi entropy for transition matricesH1+θ↓,W(X|Y) (cf. (47)). Then, we evaluate the lower conditional Rényi entropy for the Markov chain in terms of its transition matrix counterpart. More specifically, we derive an approximation:
H1+θ↓(Xn|Yn)=nH1+θ↓,W(X|Y)+O(1),
where an explicit form of theO(1) term is also derived. Using the evaluation (2) with this evaluation, we obtain finite-length bounds under Assumption 1. Under a more restrictive assumption, i.e., Assumption 2, we also introduce the upper conditional Rényi entropy for a transition matrixH1+θ↑,W(X|Y) (cf. (55)). Then, we evaluate the upper Rényi entropy for the Markov chain in terms of its transition matrix counterpart. More specifically, we derive an approximation:
H1+θ↑(Xn|Yn)=nH1+θ↑,W(X|Y)+O(1),
where an explicit form of theO(1) term is also derived. Using the evaluation (1) with this evaluation, we obtain finite-length bounds that are tighter than those obtained under Assumption 1. It should be noted that, without Assumption 1, even the conditional entropy rate is challenging to evaluate. For evaluation of the conditional entropy rate of the X process given the Y process, the assumption of the X process being Markov seems to be not helpful. This is the reason why we consider the Y process being Markov instead of the X process being Markov in this paper.
We also derive converse bounds by using the change of measure argument for Markov chains developed by the authors in the accompanying paper on information geometry [22,23]. For this purpose, we further introduce two-parameter conditional Rényi entropy and its transition matrix counterpart (cf. (18) and (59)). This novel information measure includes the lower conditional Rényi entropy and the upper conditional Rényi entropy as special cases. We clarify the relation among bounds based on these quantities by numerically calculating the upper and lower bounds for the optimal coding rate in source coding with a Markov source in Section 3.7. Owing to the second aspect (A2), this calculation shows that our finite-length bounds are very close to the optimal value. Although this numerical calculation contains a case with a very large sizen=1×105, its calculation is not as difficult because the calculation complexity behaves asO(1). That is, this calculation shows the advantage of the first aspect (A1).
Here, we would like to remark about the terminologies because there are a few ways to express exponential-type bounds. In statistics or large deviation theory, we usually use the cumulant generating function (CGF) to describe exponents. In information theory, we employ the Gallager function or the Rényi entropies. Although these three terminologies are essentially the same quantity and are related by the change of variables, the CGF and the Gallager function are convenient for some calculations because of their desirable properties such as convexity. On the other hand, the minimum entropy and collision entropy are often used as alternative information measures of Shannon entropy in the community of cryptography. Since the Rényi entropies are a generalization of the minimum entropy and collision entropy, we can regard the Rényi entropies as information measures. The information theoretic meaning of the CGF and the Gallager function are less clear. Thus, the Rényi entropies are intuitively familiar to the readers’ of this journal. The Rényi entropies have an additional advantage in that two types of bounds (e.g., (152) and (161)) can be expressed in a unified manner. Therefore, we state our main results in terms of the Rényi entropies, whereas we use the CGF and the Gallager function in the proofs. For the readers’ convenience, the relation between the Rényi entropies and corresponding CGFs are summarized in Appendix A and Appendix B.
1.3. Main Contribution for Channel Coding
An intimate relationship is known to exist between channel coding and source coding with side-information (e.g., [24,25,26]). In particular, for an additive channel, the error probability of channel coding by a linear code can be related to the corresponding source coding problem with side-information [24]. Chen et. al. also showed that the error probability of source coding with side-information by a linear encoder can be related to the error probability of a dual channel coding problem and vice versa [27] (see also [28]). Since these dual channels can be regarded as additive channels conditioned on state information, we refer to these channels as conditional additive channels (In [28], we termed these channels general additive channels, but we think “conditional” more suitably describes the situation.). In this paper, we mainly discuss a conditional additive channel, in which the additive noise is operated subject to a distribution conditioned on additional output information. Then, we convert our obtained results of source coding with side-information to the analysis on conditional additive channels. That is, using the aforementioned duality between channel coding and source coding with side-information enables us to evaluate the error probability of channel coding for additive channels. Then, we derive several finite-length analyses on additive channels.
For the same reason as source coding with side-information, we make two assumptions, Assumptions 1 and 2, on the noise process of a conditional additive channel. In this context, Assumption 1 means that the marginal systemYndeciding the behavior of the additive noiseXn is a Markov chain. It should be noted that the Gilbert–Elliott channel [29,30] with state information available at the receiver can be regarded as a conditional additive channel such that the noise process is a Markov chain satisfying both Assumptions 1 and 2 (see Example 6). Thus, we believe that Assumptions 1 and 2 are quite reasonable assumptions.
In fact, our analysis is applicable for a broader class of channels known as regular channels [31]. The class of regular channels includes conditional additive channels as a special case, and it is known as a class of channels that are similarly symmetrical. To show it, we propose a method to convert a regular channel into a conditional additive channel such that our treatment covers regular channels. Additionally, we show that the BPSK (binary phase shift keying)-AWGN (additive white Gaussian noise) channel is included in conditional additive channels.
1.4. Asymptotic Bounds and Asymptotic Tightness for Finite-Length Bounds
We present asymptotic analyses of the large and moderate deviation regimes by deriving the characterizations (for the large deviation regime, we only derive the characterizations up to the critical rate) with the use of our finite-length achievability and converse bounds, which implies that our finite-length bounds are tight in both of these deviation regimes. We also derive the second-order rate. Although this rate can be derived by the application of the central limit theorem to the information-spectrum bound, the variance involves the limit with respect to the block length because of memory. In this paper, we derive a single-letter form of the variance by using the conditional Rényi entropy for transition matrices (An alternative way to derive a single-letter characterization of the variance for the Markov chain was shown in [32] (Lemma 20). It should also be noted that a single-letter characterization can be derived by using the fundamental matrix [33]. The single-letter characterization of the variance in [12] (Section VII) and [11] (Section III) contains an error, which is corrected in this paper.).
As we will see in Theorems 11–14 and 22–25, our asymptotic results have the same forms as the counterparts of the i.i.d. case (cf. [1,6,11,12,13,14]) when the information measures for distributions in the i.i.d. case are replaced by the information measures for the transition matrices introduced in this paper.
We determine the asymptotic tightness for finite-length bounds by summarizing the relation between the asymptotic results and the finite-length bounds in Table 1. The table also describes the computational complexity of the finite-length bounds. “Solved*” indicates that those problems are solved up to the critical rates. “Ass. 1” and “Ass. 2” indicate that those problems are solved either under Assumption 1 or Assumption 2. “O(1)” indicates that both the achievability and converse parts of those asymptotic results are derived from our finite-length achievability bounds and converse bounds whose computational complexities areO(1). “Tail” indicates that both the achievability and converse parts of those asymptotic results are derived from the information-spectrum-type achievability bounds and converse bounds of which the computational complexities depend on the computational complexities of the tail probabilities.
In general, the exact computations of tail probabilities are difficult, although they may be feasible for a simple case such as an i.i.d. case. One way to compute tail probabilities approximately is to use the Berry–Esséen theorem [34] (Theorem 16.5.1) or its variant [35]. This direction of research is still ongoing [36,37], and an evaluation of the constant was conducted [37], although its tightness has not been clarified. If we can derive a tight Berry–Esséen-type bound for the Markov chain, this would enable us to derive a finite-length bound that is asymptotically tight in the second-order regime. However, the approximation errors of Berry–Esséen-type bounds converge only in the order of1/nand cannot be applied whenεis rather small. Even in cases in which the exact computations of tail probabilities are possible, the information-spectrum-type bounds are looser than the exponential type bounds whenεis rather small, and we need to use appropriate bounds depending on the size ofε . In fact, this observation was explicitly clarified in [38] for random number generation with side-information. Consequently, we believe that our exponential-type finite-length bounds are very useful. It should be also noted that, for source coding with side-information and channel coding for regular channels, even the first-order results have not been revealed as far as the authors know, and they are clarified in this paper (General formulae for those problems were known [2,3], but single-letter expressions for Markov sources or channels were not clarified in the literature. For the source coding without side-information, the single-letter expression for entropy rate of Markov source is well known (e.g., see [39]).).
1.5. Related Work on Markov Chains
Since related work concerning the finite-length analysis is reviewed in Section 1.1, we only review work related to the asymptotic analysis here. Some studies on Markov chains for the large deviation regime have been reported [40,41,42]. The derivation in [40] used the Markov-type method. A drawback of this method is that it involves a term that stems from the number of types, which does not affect the asymptotic analysis, but does hurt the finite-length analysis. Our achievability is derived by following a similar approach as in [41,42], i.e., the Perron–Frobenius theorem, but our derivation separates the single-shot part and the evaluation of the Rényi entropy, and thus is more transparent. Furthermore, the converse part of [41,42] is based on the Shannon–McMillan–Breiman limiting theorem and does not yield finite-length bounds.
For the second-order regime, Polyanskiy et. al. studied the second-order rate (dispersion) of the Gilbert–Elliott channel [43]. Tomamichel and Tan studied the second-order rate of channel coding with state information such that the state information may be a general source and derived a formula for the Markov chain as a special case [32]. Kontoyiannis studied the second-order variable length source coding for the Markov chain [44]. In [45], Kontoyiannis and Verdú derived the second-order rate of lossless source coding under the overflow probability criterion.
For channel coding of the i.i.d. case, Scarlett et al. derived a saddle-point approximation, which unifies all three regimes [46,47].
1.6. Organization of the Paper
In Section 2, we introduce the information measures and their properties that will be used in Section 3 and Section 4. Then, source coding with side-information and channel coding is discussed in Section 3 and Section 4, respectively. As we mentioned above, we state our main result in terms of the Rényi entropies, and we use the CGFs and the Gallager function in the proofs. We explain how to cover the continuous case in Remarks 1 and 5. In Appendix A and Appendix B, the relation between the Rényi entropies and corresponding CGFs are summarized. The relation between the Rényi entropies and the Gallager function are explained as necessary. Proofs of some technical results are also provided in the remaining Appendices.
1.7. Notations
For a setX, the set of all distributions onXis denoted byP(X). The set of all sub-normalized non-negative functions onXis denoted byP¯(X). The cumulative distribution function of the standard Gaussian random variable is denoted by:
Φ(t)=∫−∞t12πexp−x22dx.
Throughout the paper, the base of the logarithm is the natural base e. 2. Information Measures
Since this paper discusses the second-order tightness, we need to discuss the central limit theorem for the Markov process. For this purpose, we usually employ advanced mathematical methods from probability theory. For example, the paper [48] (Theorem 4) showed the Markov version of the central limit theorem by using a martingale stopping technique. Lalley [49] employed the regular perturbation theory of operators on the infinite-dimensional space [50] (Chapter 7, #1, Chapter 4, #3, and Chapter 3, #5). The papers [51,52] and [53] (Lemma 1.5 of Chapter 1) employed the spectral measure, while it is hard to calculate the spectral measure in general even in the finite-state case. Further, the papers [36,51,54,55] showed the central limit theorem by using the asymptotic variance, but they did not give any computable expression of the asymptotic variance without the infinite sum. In summary, to derive the central limit theorem with the variance of a computable form, these papers needed to use very advanced mathematics beyond calculus and linear algebra.
To overcome the difficulty of the Markov version of the central limit theorem, we employed the method used in our recent paper [23]. The paper [23] employed the method based on the cumulant generating function for transition matrices, which is defined by the Perron eigenvalue of a specific non-negative-entry matrix. Since a Perron eigenvalue can be explained in the framework of linear algebra, the method can be described with elementary mathematics. To employ this method, we need to define the information measure in a way similar to the cumulant generating function for transition matrices. That is, we define the information measures for transition matrices, e.g., the conditional Rényi entropy for transition matrices, etc, by using Perron eigenvalues.
Fortunately, these information measures for transition matrices are very useful even for large deviation-type evaluation and finite-length bounds. For example, our recent paper [23] derived finite-length bounds for simple hypothesis testing for the Markov chain by using the cumulant generating function for transition matrices. Therefore, using these information measures for transition matrices, this paper derives finite-length bounds for source coding and channel coding with Markov chains and discusses their asymptotic bounds with large deviation, moderate deviation, and the second-order type.
Since they are natural extensions of information measures for single-shot setting, we first review information measures for the single-shot setting in Section 2.1. Next, we introduce information measures for transition matrices in Section 2.2. Then, we show that information measures for Markov chains can be approximated by information measures for transition matrices generating those Markov chains in Section 2.3.
2.1. Information Measures for the Single-Shot Setting
In this section, we introduce conditional Rényi entropies for the single-shot setting. For more a detailed review of conditional Rényi entropies, see [21]. For a correlated random variable(X,Y)onX×Ywith probability distributionPXYand a marginal distributionQYonY, we introduce the conditional Rényi entropy of order1+θrelative toQYas:
H1+θ(PXY|QY):=−1θlog∑x,yPXY (x,y)1+θ QY (y)−θ,
whereθ∈(−1,0)∪(0,∞). The conditional Rényi entropy of order zero relative toQYis defined by the limit with respect toθ. When X has no side-information, it is nothing but the ordinary Rényi entropy, and it is denoted byH1+θ(X)=H1+θ(PX)throughout the paper.
One of the important special cases ofH1+θ(PXY|QY)is the case withQY=PY, wherePYis the marginal ofPXY. We shall call this special case the lower conditional Rényi entropy of order1+θ and denote ( this notation was first introduced in [56]):
H1+θ↓(X|Y):=H1+θ(PXY|PY)
=−1θlog∑x,yPXY (x,y)1+θ PY (y)−θ.
When we consider the second-order analysis, the variance of the entropy density plays an important role:
V(X|Y):=Varlog1PX|Y(X|Y).
We have the following property, which follows from the correspondence between the conditional Rényi entropy and the cumulant generating function (cf. Appendix B).
Lemma 1.
We have:
limθ→0H1+θ↓(X|Y)=H(X|Y)
and (as seen in the proof (cf. (A26)), the left-hand side of (11) corresponds to the second derivative of the cumulant generating function):
limθ→02H(X|Y)−H1+θ↓(X|Y)θ=V(X|Y).
Proof.
(10) follows from the relation in (A25) and the fact that the first-order derivative of the cumulant generating function is the expectation. (11) follows from (A25), (10) and (A26). □
The other important special case ofH1+θ(PXY|QY)is the measure maximized overQY. We shall call this special case the upper conditional Rényi entropy of order1+θand denote (Equation (13) for−1<θ<0follows from the Hölder inequality, and Equation (13) for0<θ follows from the reverse Hölder inequality [57] (Lemma 8). Similar optimization has appeared in the context of Rényi mutual information in [58] (see also [59]).):
H1+θ↑(X|Y):=maxQY∈P(Y)H1+θ(PXY|QY)
=H1+θ(PXY|PY(1+θ))
=−1+θθlog∑yPY(y)∑xPX|Y (x|y)1+θ11+θ,
where:
PY(1+θ)(y):=∑x PXY (x,y)1+θ11+θ∑y′ ∑x PXY (x,y′)1+θ11+θ.
For this measure, we also have the same properties as Lemma 1. This lemma will be proven in Appendix C.
Lemma 2.
We have:
limθ→0H1+θ↑(X|Y)=H(X|Y)
and:
limθ→02H(X|Y)−H1+θ↑(X|Y)θ=V(X|Y).
When we derive converse bounds, we need to consider the case such that the order of the Rényi entropy is different from the order of conditioning distribution defined in (15). For this purpose, we introduce two-parameter conditional Rényi entropy, which connects the two kinds of conditional Rényi entropiesH1+θ↓(X|Y)andH1+θ↑(X|Y)in the way as Statements 10 and 11 of Lemma 3:
H1+θ,1+θ′(X|Y)
:=H1+θ(PXY|PY(1+θ′))
=−1θlog∑yPY(y)∑xPX|Y (x|y)1+θ∑xPX|Y (x|y)1+θ′−θ1+θ′+θ′1+θ′H1+θ′↑(X|Y).
Next, we investigate some properties of the measures defined above, which will be proven in Appendix D.
Lemma 3.
1.
For fixedQY,θH1+θ(PXY|QY)is a concave function of θ, and it is strict concave iffVarlogQY(Y)PXY(X,Y)>0.
2.
For fixedQY,H1+θ(PXY|QY)is a monotonically decreasing (Technically,H1+θ(PXY|QY)is always non-increasing, and it is monotonically decreasing iff strict concavity holds in Statement 1. Similar remarks are also applied for other information measures throughout the paper.) function of θ.
3.
The functionθH1+θ↓(X|Y)is a concave function of θ, and it is strict concave iffV(X|Y)>0.
4.
H1+θ↓(X|Y)is a monotonically decreasing function of θ.
5.
The functionθH1+θ↑(X|Y)is a concave function of θ, and it is strict concave iffV(X|Y)>0.
6.
H1+θ↑(X|Y)is a monotonically decreasing function of θ.
7.
For everyθ∈(−1,0)∪(0,∞), we haveH1+θ↓(X|Y)≤H1+θ↑(X|Y).
8.
For fixedθ′, the functionθH1+θ,1+θ′(X|Y)is a concave function of θ, and it is strict concave iffV(X|Y)>0.
9.
For fixedθ′,H1+θ,1+θ′(X|Y)is a monotonically decreasing function of θ.
10.
We have:
H1+θ,1(X|Y)=H1+θ↓(X|Y).
11.
We have:
H1+θ,1+θ(X|Y)=H1+θ↑(X|Y).
12.
For everyθ∈(−1,0)∪(0,∞),H1+θ,1+θ′(X|Y)is maximized atθ′=θ.
The following lemma expresses explicit forms of the conditional Rényi entropies of order zero.
Lemma 4.
We have:
limθ→−1H1+θ(PXY|QY)=H0(PXY|QY)
:=log∑yQY(y)|supp(PX|Y(·|y))|,
limθ→−1H1+θ↑(X|Y)=H0↑(X|Y)
:=logmaxy∈supp(PY)|supp(PX|Y(·|y))|,
limθ→−1H1+θ↓(X|Y)=H0↓(X|Y)
:=log∑yPY(y)|supp(PX|Y(·|y))|.
Proof.
See Appendix E. □
The definition (6) guarantees the existence of the derivative ofd[θH1+θ(PXY|QY)]dθ. From Statement 1 of Lemma 3,d[θH1+θ(PXY|QY)]/dθis monotonically decreasing. Thus, the inverse function (Throughout the paper, the notationsθ(a)anda(R)are reused for several inverse functions. Although the meanings of those notations are obvious from the context, we occasionally put superscript Q, ↓ or ↑ to emphasize that those inverse functions are induced from corresponding conditional Rényi entropies. This definition is related to the Legendre transform of the concave functionθ↦θH1+θ↓(X|Y).) ofθ↦d[θH1+θ(PXY|QY)]/dθexists so that the functionθ(a)=θQ(a)is defined as:
d[θH1+θ(PXY|QY)]dθ|θ=θ(a)=a
fora̲<a≤a¯, wherea̲=a̲Q:=limθ→∞d[θH1+θ(PXY|QY)]/dθanda¯=a¯Q:=limθ→−1d[θH1+θ(PXY|QY)]/dθ. Let:
R(a)=RQ(a):=(1+θ(a))a−θ(a)H1+θ(a)(PXY|QY).
Since:
R′(a)=dR′(a)da=dθ(a)daa+1+θ(a)−d(θH1+θ(PXY|QY))dθdθ(a)da=dθ(a)daa+1+θ(a)−adθ(a)da=1+θ(a),
R(a)is a monotonic increasing function ofa̲<a≤R(a¯). Thus, we can define the inverse functiona(R)=aQ(R)ofR(a)by:
(1+θ(a(R)))a(R)−θ(a(R))H1+θ(a(R))(PXY|QY)=R
forR(a̲)<R≤H0(PXY|QY).
ForθH1+θ↓(X|Y), by the same reason as above, we can define the inverse functionsθ(a)=θ↓(a)anda(R)=a↓(R)by:
d[θH1+θ↓(X|Y)]dθ|θ=θ(a)=a
and:
(1+θ(a(R)))a(R)−θ(a(R))H1+θ(a(R))↓(X|Y)=R,
forR(a̲)<R≤H0↓(X|Y). ForθH1+θ↑(X|Y), we also introduce the inverse functionsθ(a)=θ↑(a)anda(R)=a↑(R)by:
dθH1+θ↑(X|Y)dθ|θ=θ(a)=a
and:
(1+θ(a(R)))a(R)−θ(a(R))H1+θ(a(R))↑(X|Y)=R
forR(a̲)<R≤H0↑(X|Y).
Remark 1.
Here, we discuss the possibility for extension to the continuous case. Since the entropy in the continuous case diverges, we cannot extend the information quantities to the case whenXis continuous. However, it is possible to extend these quantities to the case whenYis continuous, butXis a discrete finite set. In this case, we prepare a general measure μ (like the Lebesgue measure) onYand probability density functionpYandqYsuch that the distributionsPYandQYare given aspY(y)μ(dy)andqY(y)μ(dy), respectively. Then, it is sufficient to replace ∑,Q(y), andPXY(x,y)by∫Yμ(dy),PX|Y(x|y)pY(y), andqY(y), respectively. Hence, in the n-independent and identically distributed case, these information measures are given as n times the original information measures.
One might consider the information quantities for transition matrices given in the next subsection for this continuous case. However, this is not so easy because it needs a continuous extension of the Perron eigenvalue.
2.2. Information Measures for the Transition Matrix
Let{W(x,y|x′,y′)}((x,y),(x′,y′))∈(X×Y)2 be an ergodic and irreducible transition matrix. The purpose of this section is to introduce transition matrix counterparts of those measures in Section 2.1. For this purpose, we first need to introduce some assumptions on transition matrices:
Assumption 1
(Non-hidden). We say that a transition matrix W is non-hidden (with respect to Y) if the Y-marginal process is a Markov process, i.e., (The reason for the name “non-hidden” is the following. In general, the Y-marginal process is a hidden Markov process. However, when the condition (37) holds, the Y-marginal process is a Markov process. Hence, we call the condition (37) non-hidden.):
∑xW(x,y|x′,y′)=W(y|y′)
for everyx′∈Xandy,y′∈Y. This condition is equivalent to the existence of the following decomposition ofW(x,y|x′,y′):
W(x,y|x′,y′)=W(y|y′)W(x|x′,y′,y).
Assumption 2
(Strongly non-hidden). We say that a transition matrix W is strongly non-hidden (with respect to Y) if, for everyθ∈(−1,∞)andy,y′∈Y (The reason for the name “strongly non-hidden” is the following. When we compute the upper conditional Rényi entropy rate of the Markov source, the effect of the Y process may propagate infinitely even if it is non-hidden. When (39) holds, the effect of the Y process in the computation of the upper conditional Rényi entropy rate is only one step.):
Wθ(y|y′):=∑xW(x,y|x′,y′)1+θ
is well defined, i.e., the right-hand side of (39) is independent ofx′.
Assumption 1 requires (39) to hold only forθ=0, and thus, Assumption 2 implies Assumption 1. However, Assumption 2 is a strictly stronger condition than Assumption 1. For example, let us consider the case such that the transition matrix is a product form, i.e.,W(x,y|x′,y′)=W(x|x′)W(y|y′). In this case, Assumption 1 is obviously satisfied. However, Assumption 2 is not satisfied in general.
Assumption 2 has another expression as follows.
Lemma 5.
Assumption 2 holds if and only if, for everyx′≠x˜′, there exists a permutationπx′;x˜′onXsuch thatW(x|x′,y′,y)=W(πx′;x˜′(x)|x˜′,y′,y).
Proof.
Since the part “if” is trivial, we show the part “only if” as follows. By noting (38), Assumption 2 can be rephrased as:
∑xW(x|x′,y′,y)1+θ
does not depend onx′for everyθ∈(−1,∞). Furthermore, this condition can be rephrased as follows. Forx′≠x˜′, if the largest values of{W(x|x′,y′)}x∈Xand{W(x|x˜′,y′)}x∈Xare different, say the former is larger, then∑xW(x|x′,y′)1+θ>∑xW(x|x˜′,y′)1+θfor sufficiently largeθ , which contradicts the fact that (40) does not depend onx′. Thus, the largest values of{W(x|x′,y′)}x∈Xand{W(x|x˜′,y′)}x∈Xmust coincide. By repeating this argument for the second largest value of{W(x|x′,y′)}x∈Xand{W(x|x˜′,y′)}x∈X, and so on, we find that Assumption 2 implies that for everyx′≠x˜′, there exists a permutationπx′;x˜′onXsuch thatW(x|x′,y′,y)=W(πx′;x˜′(x)|x˜′,y′,y). □
Now, we fix an elementx0∈Xand transform a sequence of random numbers(X1,Y1,X2,Y2,…,Xn,Yn)to the sequence of random numbers(X1′,Y1′,X2′,Y2′,…,Xn′,Yn′):=(X1,Y1,πx0;X1−1(X2),Y2,…,πx0;X1−1(Xn),Yn). Then, lettingW′(x|y′,y):=W(x|x0,y′,y), we havePXi′,Yi′|Xi−1′,Yi−1′=W′(yi′|yi−1′)W(xi′|yi′,yi−1′). That is, essentially, the transition matrix of this case can be written by the transition matrixW(yi′|yi−1′)W′(xi′|yi′,yi−1′). Therefore, the transition matrix can be written by using the positive-entry matrixWxi′ (yi′|yi−1′):=W(yi′|yi−1′)W′(xi′|yi′,yi−1′).
The following are non-trivial examples satisfying Assumptions 1 and 2.
Example 1.
Suppose thatX=Yis a module (an additive group). Let P and Q be transition matrices onX. Then, the transition matrix given by:
W(x,y|x′,y′)=Q(y|y′)P(x−y|x′−y′)
satisfies Assumption 1. Furthermore, if transition matrixP(z|z′)can be written as:
P(z|z′)=PZ(πz′ (z))
for permutationπz′ and a distributionPZonX , then transition matrix W defined by (41) satisfies Assumption 2 as well.
Example 2.
Suppose thatXis a module and W is (strongly) non-hidden with respect toY. Let Q be a transition matrix onZ=X. Then, the transition matrix given by:
V(x,y,z|x′,y′,z′)=W(x−z,y|x′−z′,y)Q(z|z′)
is (strongly) non-hidden with respect toY×Z.
The following is also an example satisfying Assumption 2, which describes a noise process of an important class of channels with memory (cf. the Gilbert-Elliot channel in Example 6).
Example 3.
LetX=Y={0,1}. Then, let:
W(y|y′)=1−qy′ ify=y′qy′ ify≠y′
for some0<q0,q1<1, and let:
W(x|x′,y′,y)=1−pyifx=0pyifx=1
for some0<p0,p1<1. By choosingπx′;x˜′to be the identity, this transition matrix satisfies the condition given in Remark 5, which is equivalent to Assumption 2.
First, we introduce information measures under Assumption 1. In order to define a transition matrix counterpart of (7), let us introduce the following tilted matrix:
W˜θ(x,y|x′,y′):=W(x,y|x′,y′)1+θW(y|y′)−θ.
Here, we should notice that the tilted matrixW˜θis not normalized, i.e., is not a transition matrix. Letλθbe the Perron–Frobenius eigenvalue ofW˜θandP˜θ,XYbe its normalized eigenvector. Then, we define the lower conditional Rényi entropy for W by:
H1+θ↓,W(X|Y):=−1θlogλθ,
whereθ∈(−1,0)∪(0,∞). Forθ=0, we define the lower conditional Rényi entropy for W by:
HW(X|Y)=H1↓,W(X|Y)
:=limθ→0H1+θ↓,W(X|Y),
and we just call it the conditional entropy for W. In fact, the definition ofHW(X|Y)above coincides with:
−∑x′,y′P0,XY(x′,y′)∑x,yW(x,y|x′,y′)logW(x,y|x′,y′)W(y|y′),
whereP0,XY is the stationary distribution of W (cf. [60] (Equation (30))). Forθ=−1,H0↓,W(X|Y)is also defined by taking the limit. When X has no side-information, the Rényi entropyH1+θW(X)for W is defined as a special case ofH1+θ↓,W(X|Y).
As a counterpart of (11), we also define (Since the limiting expression in (51) coincides with the second derivative of the CGF (cf. (A30)) and since the second derivative of the CGF exists (cf. [22] (Appendix D)), the variance in (51) is well defined. While the definition (51) contains the limitθ→0 , it can be calculated without this type of limit by using the fundamental matrix [61] (Theorem 4.3.1), [23] (Theorem 7.7 and Remark 7.8).):
VW(X|Y):=limθ→02HW(X|Y)−H1+θ↓,W(X|Y)θ.
Remark 2.
When transition matrix W satisfies Assumption 2,H1+θ↓,W(X|Y)can be written as:
H1+θ↓,W(X|Y)=−1θlogλθ′,
whereλθ′is the Perron–Frobenius eigenvalue ofWθ(y|y′)W(y|y′)−θ. In fact, for the left Perron–Frobenius eigenvectorQ^θofWθ(y|y′)W(y|y′)−θ, we have:
∑x,yQ^θ(y)W(x,y|x′,y′)1+θW(y|y′)−θ=λθ′ Qθ(y′),
which implies thatλθ′is the Perron–Frobenius eigenvalue ofW˜θ. Consequently, we can evaluateH1+θ↓,W(X|Y)by calculating the Perron–Frobenius eigenvalue of the|Y|×|Y|matrix instead of the|X||Y|×|X||Y|matrix when W satisfies Assumption 2.
Next, we introduce information measures under Assumption 2. In order to define a transition matrix counterpart of (12), let us introduce the following|Y|×|Y|matrix:
Kθ(y|y′):=Wθ (y|y′)11+θ,
whereWθ is defined by (39). Letκθbe the Perron–Frobenius eigenvalue ofKθ. Then, we define the upper conditional Rényi entropy for W by:
H1+θ↑,W(X|Y):=−1+θθlogκθ,
whereθ∈(−1,0)∪(0,∞). Forθ=−1andθ=0,H1+θ↑,W(X|Y) is defined by taking the limit. We have the following properties, which will be proven in Appendix F.
Lemma 6.
We have:
limθ→0H1+θ↑,W(X|Y)=HW(X|Y)
and:
limθ→02HW(X|Y)−H1+θ↑,W(X|Y)θ=VW(X|Y).
Now, let us introduce a transition matrix counterpart of (18). For this purpose, we introduce the following|Y|×|Y|matrix:
Nθ,θ′(y|y′):=Wθ(y|y′)Wθ′ (y|y′)−θ1+θ′.
Letνθ,θ′be the Perron–Frobenius eigenvalue ofNθ,θ′. Then, we define the two-parameter conditional Rényi entropy by:
H1+θ,1+θ′W(X|Y):=−1θlogνθ,θ′+θ′1+θ′H1+θ′↑,W(X|Y).
Remark 3.
Although we definedH1+θ↓,W(X|Y)andH1+θ↑,W(X|Y) by (47) and (55), respectively, we can alternatively define these measures in the same spirit as the single-shot setting by introducing a transition matrix counterpart ofH1+θ(PXY|QY)as follows. For the marginalW(y|y′)ofW(x,y|x′,y′), letYW2:={(y,y′):W(y|y′)>0}. For another transition matrix V onY, we defineYV2in a similar manner. For V satisfyingYW2⊂YV2, we define (although we can also defineH1+θW|V(X|Y)even ifYW2⊂YV2 is not satisfied (see [22] for the detail), for our purpose of definingH1+θ↓,W(X|Y)andH1+θ↑,W(X|Y), other cases are irrelevant):
H1+θW|V(X|Y):=−1θlogλθW|V
forθ∈(−1,0)∪(0,∞), whereλθW|Vis the Perron–Frobenius eigenvalue of:
W(x,y|x′,y′)1+θV(y|y′)−θ.
By using this measure, we obviously have:
H1+θ↓,W(X|Y)=H1+θW|W(X|Y).
Furthermore, under Assumption 2, the relation:
H1+θ↑,W(X|Y)=maxVH1+θW|V(X|Y)
holds (see Appendix G for the proof), where the maximum is taken over all transition matrices satisfyingYW2⊂YV2.
Next, we investigate some properties of the information measures introduced in this section. The following lemma is proven in Appendix H.
Lemma 7.
1.
The functionθH1+θ↓,W(X|Y)is a concave function of θ, and it is strict concave iffVW(X|Y)>0.
2.
H1+θ↓,W(X|Y)is a monotonically decreasing function of θ.
3.
The functionθH1+θ↑,W(X|Y)is a concave function of θ, and it is strict concave iffVW(X|Y)>0.
4.
H1+θ↑,W(X|Y)is a monotonically decreasing function of θ.
5.
For everyθ∈(−1,0)∪(0,∞), we haveH1+θ↓,W(X|Y)≤H1+θ↑,W(X|Y).
6.
For fixedθ′, the functionθH1+θ,1+θ′W(X|Y)is a concave function of θ, and it is strict concave iffVW(X|Y)>0.
7.
For fixedθ′,H1+θ,1+θ′W(X|Y)is a monotonically decreasing function of θ.
8.
We have:
H1+θ,1W(X|Y)=H1+θ↓,W(X|Y).
9.
We have:
H1+θ,1+θW(X|Y)=H1+θ↑,W(X|Y).
10.
For everyθ∈(−1,0)∪(0,∞),H1+θ,1+θ′W(X|Y)is maximized atθ′=θ, i.e.,
d[H1+θ,1+θ′W(X|Y)]dθ′|θ′=θ=0.
From Statement 1 of Lemma 7,d[θH1+θ↓,W(X|Y)]/dθis monotonically decreasing. Thus, we can define the inverse functionθW(a)=θ↓,W(a)ofd[θH1+θ↓,W(X|Y)]/dθby:
d[θH1+θ↓,W(X|Y)]dθ|θ=θW(a)=a
fora̲<a≤a¯, wherea̲:=limθ→∞d[θH1+θ↓,W(X|Y)]/dθanda¯:=limθ→−1d[θH1+θ↓,W(X|Y)]/dθ. Let:
RW(a):=(1+θ(a))a−θ(a)H1+θ(a)↓,W(X|Y).
Since
RW′(a)=(1+θ(a)),
RW(a)is a monotonic increasing function ofa̲<a<RW(a¯). Thus, we can define the inverse functionaW(R)=a↓,W(R)ofRW(a)by:
(1+θ(aW(R)))aW(R)−θW(aW(R))H1+θW(aW(R))↓,W(X|Y)=R
forRW(a̲)<R<H0↓,W(X|Y), whereH0↓,W(X|Y):=limθ→−1 H1+θ↓,W(X|Y).
ForθH1+θ↑,W(X|Y), by the same reason, we can define the inverse functionθW(a)=θ↑,W(a)by:
d[θH1+θ,1+θW(a)W(X|Y)]dθ|θ=θW(a)=d[θH1+θ↑,W(X|Y)]dθ|θ=θW(a)=a,
and the inverse functionaW(R)=a↑,W(R)of:
RW(a):=(1+θW(a))a−θW(a)H1+θW(a)↑,W(X|Y)
by:
(1+θW(aW(R)))aW(R)−θW(aW(R))H1+θW(aW(R))↑,W(X|Y)=R,
forR(a̲)<R<H0↑,W(X|Y), whereH0↑,W(X|Y):=limθ→−1 H1+θ↑,W(X|Y) . Here, the first equality in (71) follows from (66).
Sinceθ↦θH1+θ↓,W(X|Y)is concave, the supremum of[−θR+θH1+θ↓,W(X|Y)]is attained at the stationary point. Furthermore, note that−1≤θ↓,W(R)≤0forHW(X|Y)≤R≤H0↓,W(X|Y). Thus, we have the following property.
Lemma 8.
The functionθW(R) defined in (67) satisfies:
sup−1≤θ≤0[−θR+θH1+θ↓,W(X|Y)]=−θW(R)R+θW(R)H1+θW(R)↓,W(X|Y)
forHW(X|Y)≤R≤H0↓,W(X|Y).
Furthermore, we have the following characterization for another type of maximization.
Lemma 9.
The functionθW(aW(R)) defined by (70) satisfies:
sup−1≤θ≤0−θR+θH1+θ↓,W(X|Y)1+θ=−θW(aW(R))aW(R)+θW(aW(R))H1+θW(aW(R))↓,W(X|Y)
forHW(X|Y)≤R≤H0↓,W(X|Y), and the functionθ(a(R)) defined in (73) satisfies:
sup−1≤θ≤0−θR+θH1+θ↑,W(X|Y)1+θ=−θW(aW(R))aW(R)+θW(aW(R))H1+θW(aW(R))↑,W(X|Y)
forHW(X|Y)≤R≤H0↑,W(X|Y).
Proof.
See Appendix I. □
Remark 4.
The combination of (49), (51), and Lemma 6 guarantees that both the conditional Rényi entropies expand as:
H1+θ↓,W(X|Y)=HW(X|Y)−12VW(X|Y)θ+o(θ),
H1+θ↑,W(X|Y)=HW(X|Y)−12VW(X|Y)θ+o(θ)
aroundθ=0. Thus, the difference of these measures significantly appears only when|θ|is rather large. For the transition matrix of Example 3 withq0=q1=0.1,p0=0.1, andp1=0.4 , we plotted the values of the information measures in Figure 1. Although the values atθ=−1 coincide in Figure 1, note that the values atθ=−1may differ in general.
In Example 1, we mentioned that the transition matrix W in (41) satisfies Assumption 2 when transition matrix P is given by (42). By computing the conditional Rényi entropies for this special case, we have:
H1+θ↑,W(X|Y)=H1+θ↓,W(X|Y)
=H1+θ(PZ),
i.e., the two kinds of conditional Rényi entropies coincide.
Now, let us consider the asymptotic behavior ofH1+θ↓,W(X|Y)aroundθ=0. Whenθ(a)is close to zero, we have:
θW(a)H1+θW(a)↓,W(X|Y)=θW(a)HW(X|Y)−12VW(X|Y)θW (a)2+o(θW (a)2).
Taking the derivative, (67) implies that:
a=HW(X|Y)−VW(X|Y)θW(a)+o(θW(a)).
Hence, when R is close toHW(X|Y), we have:
R=(1+θW(aW(R))aW(R)−θW(aW(R))H1+θW(aW(R))↓,W(X|Y)
=HW(X|Y)−1+θW(aW(R))2θW(aW(R))VW(X|Y)+o(θW(aW(R)),
i.e.,
θW(aW(R))=−R+HW(X|Y)VW(X|Y)+oR−HW(X|Y)VW(X|Y).
Furthermore, (81) and (82) imply:
−θW(aW(R))aW(R)+θW(aW(R))H1+θW(aW(R))↓,W(X|Y)
=VW(X|Y)θW (aW(R))22+o(θW (aW(R))2)
=VW(X|Y)2R−HW(X|Y)VW(X|Y)2+oR−HW(X|Y)VW(X|Y)2.
2.3. Information Measures for the Markov Chain
Let(X,Y)be the Markov chain induced by transition matrix W and some initial distributionPX1 Y1 . Now, we show how information measures introduced in Section 2.2 are related to the conditional Rényi entropy rates. First, we introduce the following lemma, which gives finite upper and lower bounds on the lower conditional Rényi entropy.
Lemma 10.
Suppose that transition matrix W satisfies Assumption 1. Letvθbe the eigenvector ofWθTwith respect to the Perron–Frobenius eigenvalueλθsuch thatminx,y vθ(x,y)=1 (since the eigenvector corresponding to the Perron–Frobenius eigenvalue for an irreducible non-negative matrix has always strictly positive entries [62] (Theorem 8.4.4, p. 508), we can choose the eigenvectorvθsatisfying this condition). Letwθ(x,y):=PX1 Y1 (x,y)1+θ PY1 (y)−θ. Then, for everyn≥1, we have:
(n−1)θH1+θ↓,W(X|Y)+δ̲(θ)≤θH1+θ↓(Xn|Yn)≤(n−1)θH1+θ↓,W(X|Y)+δ¯(θ),
where:
δ¯(θ):=−log〈vθ|wθ〉+logmaxx,yvθ(x,y),
δ̲(θ):=−log〈vθ|wθ〉,
and〈vθ|wθ〉is defined as∑x,y vθ(x,y)wθ(x,y).
Proof.
This follows from (A29) and Lemma A2. □
From Lemma 10, we have the following.
Theorem 1.
Suppose that transition matrix W satisfies Assumption 1. For any initial distribution, we have (When there is no side-information, (93) reduces to the well-known expression of the entropy rate of the Markov process [39]. Without Assumption 1, it is not clear if (93) holds or not.):
limn→∞1nH1+θ↓(Xn|Yn)=H1+θ↓,W(X|Y),
limn→∞1nH(Xn|Yn)=HW(X|Y).
We also have the following asymptotic evaluation of the variance, which follows from Lemma A3 in Appendix A.
Theorem 2.
Suppose that transition matrix W satisfies Assumption 1. For any initial distribution, we have:
limn→∞1nV(Xn|Yn)=VW(X|Y).
Theorem 2 is practically important since the limit of the variance can be described by a single-letter characterized quantity. A method to calculateVW(X|Y) can be found in [23].
Next, we show the lemma that gives the finite upper and lower bounds on the upper conditional Rényi entropy in terms of the upper conditional Rényi entropy for the transition matrix.
Lemma 11.
Suppose that transition matrix W satisfies Assumption 2. Letvθbe the eigenvector ofKθTwith respect to the Perron–Frobenius eigenvalueκθsuch thatminy vθ(y)=1. Letwθbe the|Y|-dimensional vector defined by:
wθ(y):=∑xPX1 Y1 (x,y)1+θ11+θ.
Then, we have:
(n−1)θ1+θH1+θ↑,W(X|Y)+ξ̲(θ)≤θ1+θH1+θ↑(Xn|Yn)≤(n−1)θ1+θH1+θ↑,W(X|Y)+ξ¯(θ),
where:
ξ¯(θ):=−log〈vθ|wθ〉+logmaxyvθ(y),
ξ̲(θ):=−log〈vθ|wθ〉.
Proof.
See Appendix J. □
From Lemma 11, we have the following.
Theorem 3.
Suppose that transition matrix W satisfies Assumption 2. For any initial distribution, we have:
limn→∞1nH1+θ↑(Xn|Yn)=H1+θ↑,W(X|Y).
Finally, we show the lemma that gives the finite upper and lower bounds on the two-parameter conditional Rényi entropy in terms of the two-parameter conditional Rényi entropy for the transition matrix.
Lemma 12.
Suppose that transition matrix W satisfies Assumption 2. Letvθ,θ′be the eigenvector ofNθ,θ′Twith respect to the Perron–Frobenius eigenvalueνθ,θ′such thatminy vθ,θ′(y)=1. Letwθ,θ′be the|Y|-dimensional vector defined by:
wθ,θ′(y):=∑xPX1 Y1 (x,y)1+θ∑xPX1 Y1 (x,y)1+θ′−θ1+θ′.
Then, we have:
(n−1)θH1+θ,1+θ′W(X|Y)+ζ̲(θ,θ′)≤θH1+θ,1+θ′(Xn|Yn)≤(n−1)θH1+θ,1+θ′W(X|Y)+ζ¯(θ,θ′),
where:
ζ¯(θ,θ′):=−log〈vθ,θ′|wθ,θ′〉+logmaxyvθ,θ′(y)+θξ¯(θ′),
ζ̲(θ,θ′):=−log〈vθ,θ′|wθ,θ′〉+θξ̲(θ′)
forθ>0and:
ζ¯(θ,θ′):=−log〈vθ,θ′|wθ,θ′〉+logmaxyvθ,θ′(y)+θξ̲(θ′),
ζ̲(θ,θ′):=−log〈vθ,θ′|wθ,θ′〉+θξ¯(θ′)
forθ<0
Proof.
By multiplyingθin the definition ofH1+θ,1+θ′(Xn|Yn), we have:
θH1+θ,1+θ′(Xn|Yn)
=−log∑yn∑xnPXn Yn (xn,yn)1+θ∑xnPXn Yn (xn,yn)1+θ′−θ1+θ′+θθ′1+θ′H1+θ′↑(Xn|Yn).
The second term is evaluated by Lemma 11. The first term can be evaluated almost in the same manner as Lemma 11. □
From Lemma 12, we have the following.
Theorem 4.
Suppose that transition matrix W satisfies Assumption 2. For any initial distribution, we have:
limn→∞1nH1+θ,1+θ′(Xn|Yn)=H1+θ,1+θ′W(X|Y).
3. Source Coding with Full Side-Information
In this section, we investigate source coding with side-information. We start this section by showing the problem setting in Section 3.1. Then, we review and introduce some single-shot bounds in Section 3.2. We derive finite-length bounds for the Markov chain in Section 3.3. Then, in Section 3.5 and Section 3.6, we show the asymptotic characterization for the large deviation regime and the moderate deviation regime by using those finite-length bounds. We also derive the second-order rate in Section 3.4.
3.1. Problem Formulation
A codeΨ=(e,d)consists of one encodere:X→{1,…,M}and one decoderd:{1,…,M}×Y→X. The decoding error probability is defined by:
Ps[Ψ]=Ps[Ψ|PXY]
:=Pr{X≠d(e(X),Y)}.
For notational convenience, we introduce the infimum of error probabilities under the condition that the message size is M:
Ps(M)=Ps(M|PXY)
:=infΨPs[Ψ].
For theoretical simplicity, we focus on a randomized choice of our encoder. For this purpose, we employ a randomized hash function F fromXto{1,…,M}. A randomized hash function F is called a two-universal hash whenPr{F(x)=F(x′)}≤1Mfor any distinctive x andx′ [63]; the so-called bin coding [39] is an example of the two-universal hash function. In the following, we denote the set of two-universal hash functions byF. Given an encoder f as a function fromXto{1,…,M}, we define the decoderdfas the optimal decoder byargmindPs[(f,d)]. Then, we denote the code(f,df)byΨ(f). Then, we bound the error probabilityPs[Ψ(F)]averaged over the random function F by only using the property of two-universality. In order to consider the worst case of such schemes, we introduce the following quantity:
P¯s(M)=P¯s(M|PXY)
:=supF∈FEF[Ps[Ψ(F)]],.
When we consider n-fold extension, the source code and related quantities are denoted with the superscript(n) . For example, the quantities in (112) and (114) are written asPs(n)(M)andP¯s(n)(M), respectively. Instead of evaluating them, we are often interested in evaluating:
M(n,ε):=inf{Mn:Ps(n)(Mn)≤ε},
M¯(n,ε):=inf{Mn:P¯s(n)(Mn)≤ε}
for given0≤ε<1.
3.2. Single-Shot Bounds
In this section, we review existing single-shot bounds and also show novel converse bounds. For the information measures used below, see Section 2.
By using the standard argument on information-spectrum approach, we have the following achievability bound.
Lemma 13
(Lemma 7.2.1 of [3]). The following bound holds:
P¯s(M)≤infγ≥0PXYlog1PX|Y(x|y)>γ+eγM.
Although Lemma 13 is useful for the second-order regime, it is known to be not tight in the large deviation regime. By using the large deviation technique of Gallager, we have the following exponential-type achievability bound.
Lemma 14
([64]). The following bound holds: (note that the Gallager function and the upper conditional Rényi entropy are related by (A45)):
P¯s(M)≤inf−12≤θ≤0Mθ1+θ e−θ1+θH1+θ↑(X|Y).
Although Lemma 14 is known to be tight in the large deviation regime for i.i.d. sources,H1+θ↑(X|Y)for Markov chains can only be evaluated under the strongly non-hidden assumption. For this reason, even though the following bound is looser than Lemma 14, it is useful to have another bound in terms ofH1+θ↓(X|Y), which can be evaluated for Markov chains under the non-hidden assumption.
Lemma 15.
The following bound holds:
P¯s(M)≤inf−1≤θ≤0Mθ e−θH1+θ↓(X|Y).
Proof.
To derive this bound, we change the variable in (118) asθ=θ′1−θ′. Then,−1≤θ′≤0, and we have:
Mθ′ e−θ′ H11−θ′↑(X|Y)≤Mθ′ e−θ′ H1+θ′↓(X|Y),
where we use Lemma A4 in Appendix C. □
For the source coding without side-information, i.e., when X has no side-information, we have the following bound, which is tighter than Lemma 14.
Lemma 16
((2.39) [65]). The following bound holds:
Ps(M)≤inf−1<θ≤0Mθ1+θ e−θ1+θH1+θ(X).
For the converse part, we first have the following bound, which is very close to the operational definition of source coding with side-information.
Lemma 17
([66]). Let{Ωy}y∈Ybe a family of subsetsΩy⊂X, and letΩ=∪y∈Y Ωy×{y}. Then, for anyQY∈P(Y), the following bound holds:
Ps(M)≥min{Ωy}PXY(Ωc):∑yQY(y)|Ωy|≤M.
Since Lemma 17 is close to the operational definition, it is not easy to evaluate Lemma 17. Thus, we derive another bound by loosening Lemma 17, which is more tractable for evaluation. Slightly weakening Lemma 17, we have the following.
Lemma 18
([3,4]). For anyQY∈P(Y), we have (In fact, a special case forQY=PY corresponds to Lemma 7.2.2 of [3]. A bound that involvesQY was introduced in [4] for channel coding, and it can be regarded as a source coding counterpart of that result.):
Ps(M)≥supγ≥0PXYlogQY(y)PXY(x,y)>γ−Meγ.
By using the change-of-measure argument, we also obtain the following converse bound.
Theorem 5.
For anyQY∈P(Y), we have:
−logPs(M)≤infs>0θ˜∈R,ϑ≥0[(1+s)θ˜H1+θ˜(PXY|QY)−H1+(1+s)θ˜(PXY|QY)
−(1+s)log1−2e−−ϑR+(θ˜+ϑ(1+θ˜))Hθ˜+ϑ(1+θ˜)(PXY|QY)−(1+ϑ)θ˜H1+θ˜(PXY|QY)1+ϑ]/s≤infs>0−1<θ˜<θ(a(R))[(1+s)θ˜H1+θ˜(PXY|QY)−H1+(1+s)θ˜(PXY|QY)
−(1+s)log1−2e(θ(a(R))−θ˜)a(R)−θ(a(R))H1+θ(a(R))(PXY|QY)+θ˜H1+θ˜(PXY|QY)]/s,
whereR=logM, andθ(a)=θQ(a)anda(R)=aQ(R) are the inverse functions defined in (29) and (32), respectively.
Proof.
See Appendix K. □
In particular, by takingQY=PY(1+θ(a(R)))in Theorem 5, we have the following.
Corollary 1.
We have:
−logPs(M)≤infs>0−1<θ˜<θ(a(R))[(1+s)θ˜H1+θ˜,1+θ(a(R))(X|Y)−H1+(1+s)θ˜,1+θ(a(R))(X|Y)
−(1+s)log1−2e(θ(a(R))−θ˜)a(R)−θ(a(R))H1+θ(a(R))↑(X|Y)+θ˜H1+θ˜,1+θ(a(R))(X|Y)]/s,
whereθ(a)=θ↑(a)anda(R)=a↑(R) are the inverse functions defined in (35) and (36).
Remark 5.
Here, we discuss the possibility for extension to the continuous case. As explained in Remark 1, we can define the information quantities for the case whenYis continuous, butXis a discrete finite set. The discussions in this subsection still hold even in this continuous case. In particular, in the n-i.i.d. extension case with this continuous setting, Lemma 14 and Corollary 1 hold when the information measures are replaced by n times the single-shot information measures.
3.3. Finite-Length Bounds for Markov Source In this subsection, we derive several finite-length bounds for the Markov source with a computable form. Unfortunately, it is not easy to evaluate how tight those bounds are only with their formula. Their tightness will be discussed by considering the asymptotic limit in the remaining subsections of this section. Since we assume the irreducibility for the transition matrix describing the Markov chain, the following bound holds with any initial distribution.
To derive a lower bound on−logP¯s(Mn)in terms of the Rényi entropy of the transition matrix, we substitute the formula for the Rényi entropy given in Lemma 10 into Lemma 15. Then, we can derive the following achievability bound.
Theorem 6
(Direct, Ass. 1). Suppose that transition matrix W satisfies Assumption 1. LetR:=1nlogMn. Then, for everyn≥1, we have:
−logP¯s(n)(Mn)≥sup−1≤θ≤0−θnR+(n−1)θH1+θ↓,W(X|Y)+δ̲(θ),
whereδ̲(θ)is given by (91).
For the source coding without side-information, from Lemma 16 and a special case of Lemma 10, we have the following achievability bound.
Theorem 7
(Direct, no-side-information). LetR:=1nlogMn. Then, for everyn≥1, we have:
−logPe(n)(Mn)≥sup−1<θ≤0−nθR+(n−1)θH1+θW(X)+δ̲(θ)1+θ.
To derive an upper bound on−logPs(Mn)in terms of the Rényi entropy of transition matrix, we substitute the formula for the Rényi entropy given in Lemma 10 for Theorem 5. Then, we have the following converse bound.
Theorem 8
(Converse, Ass. 1). Suppose that transition matrix W satisfies Assumption 1. LetR:=1nlogMn. For anyHW(X|Y)<R<H0↓,W(X|Y), we have:
−logPs(n)(Mn)≤infs>0−1<θ˜<θ(a(R))[(n−1)(1+s)θ˜H1+θ˜↓,W(X|Y)−H1+(1+s)θ˜↓,W(X|Y)+δ1
−(1+s)log1−2e(n−1)[(θW(aW(R))−θ˜)aW(R)−θW(aW(R))H1+θW(aW(R))↓,W(X|Y)+θ˜H1+θ˜↓,W(X|Y)]+δ2]/s,
whereθ(a)=θ↓(a)anda(R)=a↓(R) are the inverse functions defined by (67) and (70), respectively,
δ1:=(1+s)δ¯(θ˜)−δ̲((1+s)θ˜),
δ2:=(θW(aW(R))−θ˜)R−(1+θ˜)δ̲(θW(aW(R)))+(1+θW(aW(R)))δ¯(θ˜)1+θW(aW(R)),
andδ¯(·)andδ̲(·) are given by (90) and (91), respectively.
Proof.
We first use (124) of Theorem 5 forQYn =PYn and Lemma 10. Then, we restrict the range ofθ˜as−1<θ˜<θW(aW(R))and setϑ=θW(aW(R))−θ˜1+θ˜. Then, we have the assertion of the theorem. □
Next, we derive tighter bounds under Assumption 2. To derive a lower bound on−logP¯s(Mn)in terms of the Rényi entropy of the transition matrix, we substitute the formula for the Rényi entropy in Lemma 11 for Lemma 14. Then, we have the following achievability bound.
Theorem 9
(Direct, Ass. 2). Suppose that transition matrix W satisfies Assumption 2. LetR:=1nlogMn. Then, we have:
−logP¯s(n)(Mn)≥sup−12≤θ≤0−θnR+(n−1)θH1+θ↑,W(X|Y)1+θ+ξ̲(θ),
whereξ̲(θ)is given by (98).
Finally, to derive an upper bound on−logPs(Mn)in terms of the Rényi entropy for the transition matrix, we substitute the formula for the Rényi entropy in Lemma 12 for Theorem 5 forQYn =PYn(1+θW(aW(R))). Then, we can derive the following converse bound.
Theorem 10
(Converse, Ass. 2). Suppose that transition matrix W satisfies Assumption 2. LetR:=1nlogMn. For anyHW(X|Y)<R<H0↑,W(X|Y), we have:
−logPs(n)(Mn)≤infs>0−1<θ˜<θW(aW(R))[(n−1)(1+s)θ˜H1+θ˜,1+θW(aW(R))W(X|Y)−H1+(1+s)θ˜,1+θW(aW(R))W(X|Y)+δ1
−(1+s)log1−2e(n−1)[(θW(aW(R))−θ˜)aW(R)−θW(aW(R))H1+θW(aW(R))↑,W(X|Y)+θ˜H1+θ˜,1+θW(aW(R))W(X|Y)]+δ2]/s,
whereθW(a)=θ↑,W(a)andaW(R)=a↑,W(R) are the inverse functions defined by (71) and (73), respectively,
δ1:=(1+s)ζ¯(θ˜,θW(aW(R)))−ζ̲((1+s)θ˜,θW(aW(R))),
δ2:=(θW(aW(R))−θ˜)R−(1+θ˜)ζ̲(θW(aW(R)),θW(aW(R)))+(1+θW(aW(R)))ζ¯(θ˜,θW(aW(R)))1+θW(aW(R)),
andζ¯(·,·)andζ̲(·,·) are given by (102)–(105).
Proof.
We first use (124) of Theorem 5 forQYn =PYn(1+θW(aW(R)))and Lemma 12. Then, we restrict the range ofθ˜as−1<θ˜<θW(aW(R))and setϑ=θW(aW(R))−θ˜1+θ˜. Then, we have the assertion of the theorem. □
3.4. Second-Order
By applying the central limit theorem to Lemma 13 (cf. [67] (Theorem 27.4, Example 27.6)) and Lemma 18 forQY=PYand by using Theorem 2, we have the following.
Theorem 11.
Suppose that transition matrix W onX×Ysatisfies Assumption 1. For arbitraryε∈(0,1), we have:
logM(n,ε)=logM¯(n,ε)+o(n)=nHW(X|Y)+VW(X|Y)nΦ(1−ε)+o(n).
Proof.
The central limit theorem for the Markov process cf. [67] (Theorem 27.4, Example 27.6) guarantees that the random variable(−logPXn|Yn(Xn|Yn)−nHW(X|Y))/nasymptotically obeys the normal distribution with average zero and varianceVW(X|Y), where we use Theorem 2 to show that the limit of the variance is given byVW(X|Y). LetR=VW(X|Y)Φ−1(1−ε). SubstitutingM=enHW(X|Y)+nRandγ=nHW(X|Y)+nR−n14in Lemma 13, we have:
limn→∞P¯s(n)enHW(X|Y)+nR≤ε.
□
On the other hand, substitutingM=enHW(X|Y)+nRandγ=nHW(X|Y)+nR+n14in Lemma 18 forQY=PY, we have:
limn→∞Ps(n)enHW(X|Y)+nR≥ε.
Combining (140) and (141), we have the statement of the theorem.
From the above theorem, the (first-order) compression limit of source coding with side-information for a Markov source under Assumption 1 is given by (although the compression limit of source coding with side-information for a Markov chain is known more generally [68], we need Assumption 1 to get a single-letter characterization):
limn→∞1nlogM(n,ε)=limn→∞1nlogM¯(n,ε)
=HW(X|Y)
for anyε∈(0,1). In the next subsections, we consider the asymptotic behavior of the error probability when the rate is larger than the compression limitHW(X|Y)in the moderate deviation regime and the large deviation regime, respectively.
3.5. Moderate Deviation From Theorems 6 and 8, we have the following.
Theorem 12.
Suppose that transition matrix W satisfies Assumption 1. For arbitraryt∈(0,1/2)andδ>0, we have:
limn→∞−1n1−2tlogPs(n)enHW(X|Y)+n1−tδ=limn→∞−1n1−2tlogP¯s(n)enHW(X|Y)+n1−tδ
=δ22VW(X|Y).
Proof.
We apply Theorems 6 and 8 to the case withR=HW(X|Y)+n−tδ, i.e.,θ(a(R))=−n−1δVW(X|Y)+o(n−t) . For the achievability part, from (88) and Theorem 6, we have:
−logPs(n)Mn≥sup−1≤θ≤0−θnR+(n−1)θH1+θ↓,W(X|Y)+inf−1≤θ≤0δ̲(θ)
≥n1−2tδ22VW(X|Y)+o(n1−2t).
To prove the converse part, we fix arbitrarys>0and chooseθ˜to be−n−tδVW(X|Y)+n−2t. Then, Theorem 8 implies that:
lim supn→∞−1n1−2tlogPs(Mn)≤lim supn→∞n2t1+ssθ˜H1+θ˜↓,W(X|Y)−H1+(1+s)θ˜↓,W(X|Y)
=lim supn→∞n2t1+sssθ˜2dH1+θ↓,W(X|Y)dθ|θ=θ˜
=(1+s)δ22VW(X|Y).
□
Remark 6.
In the literature [13,69], the moderate deviation results are stated forϵnsuch thatϵn→0andnϵn2→∞instead ofn−tfort∈(0,1/2). Although the former is slightly more general than the latter, we employ the latter formulation in Theorem 12 since the order of convergence is clearer. In fact,n−tin Theorem 12 can be replaced by generalϵnwithout modifying the argument of the proof.
3.6. Large Deviation From Theorems 6 and 8, we have the following.
Theorem 13.
Suppose that transition matrix W satisfies Assumption 1. ForHW(X|Y)<R, we have:
lim infn→∞−1nlogP¯s(n)enR≥sup−1≤θ≤0[−θR+θH1+θ↓,W(X|Y)].
On the other hand, forHW(X|Y)<R<H0↓,W(X|Y), we have:
lim supn→∞−1nlogPs(n)enR≤−θ(a(R))a(R)+θ(a(R))H1+θ(a(R))↓,W(X|Y)
=sup−1<θ≤0−θR+θH1+θ↓,W(X|Y)1+θ.
Proof.
The achievability bound (151) follows from Theorem 6. The converse part (152) is proven from Theorem 8 as follows. We first fixs>0and−1<θ˜<θ(a(R)). Then, Theorem 8 implies:
lim supn→∞−1nlogPs(n)enR≤1+ssθ˜H1+θ˜↓,W(X|Y)−H1+(1+s)θ˜↓,W(X|Y).
By taking the limits→0andθ˜→θ(a(R)), we have:
1+ssθ˜H1+θ˜↓,W(X|Y)−H1+(1+s)θ˜↓,W(X|Y)
=1sθ˜H1+θ˜↓,W(X|Y)−(1+s)θ˜H1+(1+s)θ˜↓,W(X|Y)+θ˜H1+θ˜↓,W(X|Y)
→−θ˜d[θH1+θ↓,W(X|Y)]dθ|θ=θ˜+θ˜H1+θ˜↓,W(X|Y)(ass→0)
→−θ(a(R))d[θH1+θ↓,W(X|Y)]dθ|θ=θ(a(R))+θ(a(R))H1+θ(a(R))↓,W(X|Y)(asθ˜→θ(a(R)))
=−θ(a(R))a(R)+θ(a(R))H1+θ(a(R))↓,W(X|Y).
Thus, (152) is proven. The alternative expression (153) is derived via Lemma 9. □
Under Assumption 2, from Theorems 9 and 10, we have the following tighter bound.
Theorem 14.
Suppose that transition matrix W satisfies Assumption 2. ForHW(X|Y)<R, we have:
lim infn→∞−1nlogP¯s(n)enR≥sup−12≤θ≤0−θR+θH1+θ↑,W(X|Y)1+θ.
On the other hand, forHW(X|Y)<R<H0↑,W(X|Y), we have:
lim supn→∞−1nlogPs(n)enR≤−θ(a(R))a(R)+θ(a(R))H1+θ(a(R))↑,W(X|Y)
=sup−1<θ≤0−θR+θH1+θ↑,W(X|Y)1+θ.
Proof.
The achievability bound (160) follows from Theorem 9. The converse part (161) is proven from Theorem 10 as follows. We first fixs>0and−1<θ˜<θ(a(R)). Then, Theorem 10 implies:
lim supn→∞−1nlogPs(n)enR≤1+ssθ˜H1+θ˜,1+θ(a(R))W(X|Y)−H1+(1+s)θ˜,1+θ(a(R))W(X|Y).
By taking the limits→0andθ˜→θ(a(R)), we have:
1+ssθ˜H1+θ˜,1+θ(a(R))W(X|Y)−H1+(1+s)θ˜,1+θ(a(R))W(X|Y)
=1sθ˜H1+θ˜,1+θ(a(R))W(X|Y)−(1+s)θ˜H1+(1+s)θ˜,1+θ(a(R))W(X|Y)+θ˜H1+θ˜,1+θ(a(R))W(X|Y)
→−θ˜d[θH1+θ,1+θ(a(R))W(X|Y)]dθ|θ=θ˜+θ˜H1+θ˜,1+θ(a(R))W(X|Y)(ass→0)
→−θ(a(R))d[θH1+θ,1+θ(a(R))W(X|Y)]dθ|θ=θ(a(R))+θ(a(R))H1+θ(a(R))↑,W(X|Y)(asθ˜→θ(a(R)))
=−θ(a(R))a(R)+θ(a(R))H1+θ(a(R))↑,W(X|Y).
Thus, (161) is proven. The alternative expression (162) is derived via Lemma 9.□
Remark 7.
ForR≤Rcr , where (cf. (72) for the definition ofR(a)):
Rcr:=Rd[θH1+θ↑,W(X|Y)]dθ|θ=−12
is the critical rate, the left-hand side of (76) in Lemma 9 is attained by parameters in the range−1/2≤θ≤0 . Thus, the lower bound in (160) is rewritten as:
sup−12≤θ≤0−θR+θH1+θ↑,W(X|Y)1+θ=−θ(a(R))a(R)+θ(a(R))H1+θ(a(R))↑,W(X|Y).
Thus, the lower bound and the upper bounds coincide up to the critical rate.
Remark 8.
For the source coding without side-information, by taking the limit of Theorem 7, we have:
lim infn→∞−1nlogP¯s(n)enR≥sup−1≤θ≤0−θR+θH1+θW(X)1+θ.
On the other hand, as a special case of (152) without side-information, we have:
lim supn→∞−1nlogPs(n)enR≤sup−1<θ≤0−θR+θH1+θW(X)1+θ
forHW(X)<R<H0W(X) . Thus, we can recover the results in [40,41] by our approach.
3.7. Numerical Example
In this section, to demonstrate the advantage of our finite-length bound, we numerically evaluate the achievability bound in Theorem 7 and a special case of the converse bound in Theorem 8 for the source coding without side-information. Thanks to the aspect (A2), our numerical calculation shows that our upper finite-length bounds are very close to our lower finite-length bounds when the size n is sufficiently large. Thanks to the aspect (A1), we could calculate both bounds with the huge sizen=1×105because the calculation complexity behaves asO(1).
We consider a binary transition matrix W given by Figure 2, i.e.,
W=1−pqp1−q.
In this case, the stationary distribution is:
P˜(0)=qp+q,
P˜(1)=pp+q.
The entropy is:
HW(X)=qp+qh(p)+pp+qh(q),
whereh(·)is the binary entropy function. The tilted transition matrix is:
Wθ=(1−p)1+θq1+θp1+θ(1−q)1+θ.
The Perron–Frobenius eigenvalue is:
λθ=(1−p)1+θ+(1−q)1+θ+{(1−p)1+θ−(1−q)1+θ}2+4p1+θ q1+θ2
and its normalized eigenvector is:
P˜θ(0)=q1+θλθ−(1−p)1+θ+q1+θ,
P˜θ(1)=λθ−(1−p)1+θλθ−(1−p)1+θ+q1+θ.
The normalized eigenvector ofWρTis also given by:
P^θ(0)=p1+θλθ−(1−p)1+θ+p1+θ,
P^θ(1)=λθ−(1−p)1+θλθ−(1−p)1+θ+p1+θ.
From these calculations, we can evaluate the bounds in Theorems 7 and 8. Forp=0.1,q=0.2 , the bounds are plotted in Figure 3 for fixed error probabilityε=10−3 . Although there is a gap between the achievability bound and the converse bound for rather small n, the gap is less than approximately 5% of the entropy rate for n larger than 10,000. We also plot the bounds in Figure 4 for fixed block lengthn=10,000 and varyingε. The gap between the achievability bound and the converse bound remains approximately 5% of the entropy rate even forεas small as10−10.
The gap between the achievability bound and the converse bound in Figure 3 is rather large compared to a similar numerical experiment conducted in [1]. One reason for the gap is that our bounds are exponential-type bounds. For instance, when the source is i.i.d., the achievability bound essentially reduces to the so-called Gallager bound [64]. However, an advantage of our bounds is that the computational complexity does not depend on the blocklength. The computational complexities of the bounds plotted in [1] depend the blocklength, and numerical computation of those bounds for Markov sources seems to be difficult.
Whenp=q, an alternative approach to derive tighter bounds is to consider encoding of the Markov transition, i.e.,1[Xi=Xi+1] , instead of the source itself (cf. [45] (Example 4)). Then, the analysis can be reduced to i.i.d. case. However, such an approach is possible only whenp=q.
3.8. Summary of the Results
The obtained results in this section are summarized in Table 2. The check marks ?? indicate that the tight asymptotic bounds (large deviation, moderate deviation, and second-order) can be obtained from those bounds. The marks ??*indicate that the large deviation bound can be derived up to the critical rate. The computational complexity “Tail” indicates that the computational complexities of those bounds depend on the computational complexities of tail probabilities. It should be noted that Theorem 8 is derived from a special case (QY=PY) of Theorem 5. The asymptotically optimal choice isQY=PY(1+θ), which corresponds to Corollary 1. Under Assumption 1, we can derive the bound of the Markov case only for that special choice ofQY, while under Assumption 2, we can derive the bound of the Markov case for the optimal choice ofQY.
4. Channel Coding
In this section, we investigate the channel coding with a conditional additive channel. The first part of this section discusses the general properties of the channel coding with a conditional additive channel. The second part of this section discusses the properties of the channel coding when the conditional additive noise of the channel is Markov. The first part starts with showing the problem setting in Section 4.1 by introducing a conditional additive channel. Section 4.2 gives a canonical method to convert a regular channel to a conditional additive channel. Section 4.3 gives a method to convert a BPSK-AWGN channel to a conditional additive channel. Then, we show some single-shot achievability bounds in Section 4.4 and single-shot converse bounds in Section 4.5.
As the second part, we derive finite-length bounds for the Markov noise channel in Section 4.6. Then, we derive the second-order rate in Section 4.7. In Section 4.8 and Section 4.9, we show the asymptotic characterization for the large deviation regime and the moderate deviation regime by using those finite-length bounds.
4.1. Formulation for the Conditional Additive Channel
4.1.1. Single-Shot Case
We first present the problem formulation in the single-shot setting. For a channelPB|A(b|a)with input alphabetAand output alphabetB, a channel codeΨ=(e,d)consists of one encodere:{1,…,M}→Aand one decoderd:B→{1,…,M}. The average decoding error probability is defined by:
Pc[Ψ]:=∑m=1M1MPB|A({b:d(b)≠m}|e(m)).
For notational convenience, we introduce the error probability under the condition that the message size is M:
Pc(M):=infΨPc[Ψ].
Assume that the input alphabetAis the same set as the output alphabetBand they equal an additive groupX. When the transition matrixPB|A(b|a)is given asPX(b−a)by using a distributionPXonX, the channel is called additive.
To extend the concept of the additive channel, we consider the case when the input alphabetAis an additive groupXand the output alphabetBis the product setX×Y. When the transition matrixPB|A(x,y|a)is given asPXY(x−a,y)by using a distributionPXYonX×Y , the channel is called conditional additive. In this paper, we are exclusively interested in the conditional additive channel. As explained in Section 4.2, a channel is a conditional additive channel if and only if it is a regular channel in the sense of [31]. When we need to express the underlying distribution of the noise explicitly, we denote the average decoding error probability byPc[Ψ|PXY].
4.1.2. n-Fold Extension
When we consider n-fold extension, the channel code is denoted with subscript n such asΨn=(en,dn) . The error probabilities given in (183) and (184) are written with the superscript(n)asPc(n)[Ψn]andPc(n)(Mn), respectively. Instead of evaluating the error probabilityPc(n)(Mn)for givenMn, we are also interested in evaluating:
M(n,ε):=supMn:Pc(n)(Mn)≤ε
for given0≤ε≤1.
When the channel is given as a conditional distribution, the channel is given by:
PBn|An(xn,yn|an)=PXn Yn(xn−an,yn),
wherePXn Ynis a noise distribution onXn×Yn.
For the code construction, we investigate the linear code. For an(n,k)linear codeCn⊂An, there exists a parity check matrixfn:An→An−ksuch that the kernel offnisCn. That is, given a parity check matrixfn:An→An−k, we define the encoderIKer(fn):Cn→Anas the imbedding of the kernelKer(fn). Then, using the decoderdfn :=argmindPc[(IKer(fn),d)], we defineΨ(fn)=(IKer(fn),dfn ).
Here, we employ a randomized choice of a parity check matrix. In particular, instead of a two-universal hash function, we focus on linear two-universal hash functions, because the linearity is required in the above relation with source coding. Therefore, denoting the set of linear two-universal hash functions fromAntoAn−kbyFl, we introduce the quantity:
P¯c(n,k):=supFn∈FlEFn Pc(n)[Ψ(Fn)].
Taking the infimum over all linear codes associated withFn (cf. (113)), we obviously have:
Pc(n) (|A|k)≤P¯c(n,k).
When we consider the error probability for conditionally additive channels, we use notationP¯c(n,k|PXY)so that the underlying distribution of the noise is explicit. We are also interested in characterizing:
k(n,ε):=supk:P¯c(n,k)≤ε
for given0≤ε≤1.
4.2. Conversion from the Regular Channel to the Conditional Additive Channel The aim of this subsection is to show the following theorem by presenting the conversion rule between these two types of channels. Then, we see that a binary erasure symmetric channel is an example of a regular channel.
Theorem 15.
A channel is a regular channel in the sense of [31] if and only if it can be written as a conditional additive channel.
To show the conversion from a conditional additive channel to a regular channel, we assume that the input alphabetAhas an additive group structure. LetPX˜be a distribution on the output alphabetB. Letπabe a representation of the groupAonB, and letG={πa:a∈A} . A regular channel [31] is defined by:
PB|A(b|a)=PX˜(πa(b)).
The group action induces orbit:
Orb(b):={πa(b):a∈A}.
The set of all orbits constitutes a disjoint partition ofB. A set of the orbits is denoted byB¯, and letOrb:B→B¯be the map to the representatives.
Example 4
(Binary erasure symmetric channel). LetA={0,1},B={0,1,?}, and:
PX˜(b)=1−p−p′ifb=0pifb=1p′ifb=?.
Then, let:
π0=01?01?,π1=01?10?.
The channel defined in this way is a regular channel (see Figure 5). In this case, there are two orbits:{0,1}and{?}.
LetB=X×YandPX˜=PXYfor some joint distribution onX×Y. Now, we consider a conditional additive channel, whose transition matrixPB|A(x,y|a)is given asPXY(x−a,y). When the group action is given byπa(x,y)=(x−a,y), the above conditional additive channel is given as a regular channel. In this case, there are|Y|orbits, and the size of each orbit is|X|, respectively. This fact shows that any conditional additive channel is written as a regular channel. That is, it shows the “if” part of Theorem 15.
Conversely, we present the conversion from a regular channel to a conditional additive channel. We first explain the construction for the single-shot channel. For random variableX˜∼PX˜, letY=B¯andY=ϖ(X˜)be the random variable describing the representatives of the orbits. Fory=Orb(b)and each orbitOrb(b), we fix an element0y∈Orb(b). Then, we define:
PY(y):=PX˜(Orb(b)),PX,Y(a,y):=PX˜(πa(0y))|{a′∈A|πa(0y)=πa′ (0y)}|.
Then, we obtain the virtual channelPX,Y|AasPX,Y|A(x,y|a):=PX,Y(x−a,y). Using the conditional distributionsPX,Y|BandPB|X,Yas:
PX,Y|B(a,y|b)=1|{a′∈A|πa(0y)=πa′ (0y)}|whenb=πa(0y)0otherwise.
PX,Y|B(a,y|b)=1whenb=πa(0y)0otherwise,
we obtain the relations:
PB|A(b|a)=∑x,yPB|X,Y(b|x,y)PX,Y|A(x,y|a),PX,Y|A(x,y|a)=∑bPX,Y|B(x,y|b)PB|A(b|a).
These two equations show that the receiver information of the virtual conditional additive channelPX,Y|Aand the receiver information of the regular channelPB|A can be converted into each other. Hence, we can say that a regular channel in the sense of [31] can be written as a conditional additive channel, which shows the “only if” part of Theorem 15.
Example 5
(Binary erasure symmetric channel revisited). We convert the regular channel of Example 4 to a conditional additive channel. Let us label the orbit{0,1}asy=0and{?}asy=1. Let00=0and01=?.
PX,Y(x,0)=1−p−p′ifx=0pifx=0
PX,Y(x,1)=p′2.
When we consider the nth extension, a channel is given by:
PBn|An(bn|an)=PX˜n (πan (bn)),
where the nth extension of the group action is defined byπan (bn)=(πa1 (b1),…,πan (bn)).
Similarly, for n-fold extension, we can also construct the virtual conditional additive channel. More precisely, forX˜n∼PX˜n , we setYn=ϖ(X˜n)=(ϖ(X˜1),…,ϖ(X˜n))and:
PXn,Yn(xn,yn):=PX˜n (πan (0yn ))|{a′n∈An|πan (0yn )=πa′n (0yn )}|.
4.3. Conversion of the BPSK-AWGN Channel into the Conditional Additive Channel
Although we only considered finite input/output sources and channels throughout the paper, in order to demonstrate the utility of the conditional additive channel framework, let us consider the additive white Gaussian noise (AWGN) channel with binary phase shift keying (BPSK) in this section. LetA={0,1}be the input alphabet of the channel, and letB=Rbe the output alphabet of the channel. For an inputa∈Aand Gaussian noise Z with mean zero and varianceσ2, the output of the channel is given byB=(−1)a+Z. Then, the conditional probability density function of this channel is given as:
PB|A(b|a)=12πσe−(b−(−1)a)2 σ2.
Now, to define a conditional additive channel, we chooseY:=R+and define the probability density functionpYonYwith respect to the Lebesgue measure and the conditional distributionPX|Y(x|y)as:
pY(y):=12πσ(e−(y−1)2 σ2+e−(y+1)2 σ2)
PX|Y(0|y):=e−(y−1)2 σ2e−(y−1)2 σ2+e−(y+1)2 σ2
PX|Y(1|y):=e−(y+1)2 σ2e−(y−1)2 σ2+e−(y+1)2 σ2
fory∈R+. When we defineb:=(−1)xy∈Rforx∈{0,1}andy∈R+, we have:
pXY|A(y,x|a)=12πσe−(y−(−1)a+x)2 σ2=12πσe−((−1)xy−(−1)a)2 σ2=12πσe−(b−(−1)a)2 σ2.
The relations (202) and (206) show that the AWGN channel with BPSK is given as a conditional additive channel in the above sense.
By noting this observation, as explained in Remark 5, the single-shot achievability bounds in Section 3.2 are also valid for continuous Y. Furthermore, the discussions for the single-shot converse bounds in Section 4.5 hold even for continuous Y. Therefore, the bounds in Section 4.4 and Section 4.5 are also applicable to the BPSK-AWGN channel.
In particular, in the n memoryless extension of the BPSK-AWGN channel, the information measures for the noise distribution are given as n times the single-shot information measures for the noise distribution. Even in this case, the upper and lower bounds in Section 4.4 and Section 4.5 are also applicable by replacing the information measures by n times the single-shot information measures. Therefore, we obtain finite-length upper and lower bounds of the optimal coding length for the memoryless BPSK-AWGN channel. Furthermore, even though the additive noise is not Gaussian, when the probability density functionpZof the additive noise Z satisfies the symmetrypZ(z)=pZ(−z), the BPSK channel with the additive noise Z can be converted to a conditional additive channel in the same way.
4.4. Achievability Bound Derived by Source Coding with Side-Information In this subsection, we give a code for a conditional additive channel from a code of source coding with side-information in a canonical way. In this construction, we see that the decoding error probability of the channel code equals that of the source code.
When the channel is given as the conditional additive channel with conditional additive noise distributionPXn Yn as (186) andX=Ais the finite fieldFq, we can construct a linear channel code from a source code with full side-information whose encoder and decoder arefnanddnas follows. First, we assume linearity for the source encoderfn. LetCn(fn)be the kernel of the linear encoderfnof the source code. Suppose that the sender sends a codewordcn∈Cn(fn)and(cn+Xn,Yn)is received. Then, the receiver computes the syndromefn(cn+Xn)=fn(cn)+fn(Xn)=fn(Xn), estimatesXnfromfn(Xn)andYn, and subtracts the estimate fromcn+Xn. That is, we choose the channel decoderd˜nas:
d˜n(x′n,yn):=x′n−dn(fn(x′n),yn).
We succeed in decoding in this channel coding if and only ifdn(fn(Xn),Yn)equalsXn. Thus, the error probability of this channel code coincides with that of the source code for the correlated source(Xn,Yn) . In summary, we have the following lemma, which was first pointed out in [27].
Lemma 19
([27], (19)). Given a linear encoderfnand a decoderdnfor a source code with side-information with distributionPXn Yn, letIKer(fn)andd˜nbe the channel encoder and decoder induced by(fn,dn). Then, the error probability of channel coding for the conditionally additive channel with noise distributionPXn Ynsatisfies:
Pc(n)[(IKer(fn),d˜n)|PXn Yn]=Ps(n)[(fn,dn)|PXn Yn].
Furthermore, (in fact, when we additionally impose the linearity on the random function F in the definition (114) for the definition ofP¯s(M|PXn Yn) , the result in [27] implies that the equality in (209) holds) taking the infimum forFnchosen to be a linear two-universal hash function, we also have:
P¯c(n,k)=supFn∈FlEFn Pc(n)[Ψ(Fn)]≤supFn∈FlEFn Pc(n)[(IKer(Fn),d˜n)]=supFn∈FlEFn Ps(n)[(Fn,dn)]≤supFn∈FEFn Ps(n)[(Fn,dn)]=P¯s(n)(|An−k|).
By using this observation and the results in Section 3.2, we can derive the achievability bounds. By using the conversion argument in Section 4.2, we can also construct a channel code for a regular channel from a source code with full side-information. Although the following bounds are just a specialization of known bounds for conditional additive channels, we review these bounds here to clarify the correspondence between the bounds in source coding with side-information and channel coding.
From Lemma 13 and (209), we have the following.
Lemma 20
([2]). The following bound holds:
P¯c(n,k)≤infγ≥0PXn Ynlog1PXn|Yn(xn|yn)>γ+eγ |A|n−k.
From Lemma 14 and (209), we have the following exponential-type bound.
Lemma 21
([6]). The following bound holds:
P¯c(n,k)≤inf−12≤θ≤0|A|θ(n−k)1+θ e−θ1+θH1+θ↑(Xn|Yn).
From Lemma 15 and (209), we have the following slightly loose exponential bound.
Lemma 22
([3,70]). The following bound holds (The bound (212) was derived in the original Japanese edition of [3], but it is not written in the English edition [3]. The quantum analogue was derived in [70].):
P¯c(n,k)≤inf−1≤θ≤0|A|θ(n−k) e−θH1+θ↓(Xn|Yn).
When X has no side-information, i.e., the virtual channel is additive, we have the following special case of Lemma 21.
Lemma 23
([6]). Suppose that X has no side-information. Then, the following bound holds:
P¯c(n,k)≤inf−12≤θ≤0|A|θ(n−k)1+θ e−θ1+θH1+θ(Xn).
4.5. Converse Bound
In this subsection, we show some converse bounds. The following is the information spectrum-type converse shown in [4].
Lemma 24
([4], Lemma 4). For any codeΨn=(en,dn)and any output distributionQBn ∈P(Bn), we have:
Pc(n)[Ψn]≥supγ≥0∑m=1Mn1MnPBn|AnlogPBn|An(bn|en(m))QBn (bn)<γ−eγ Mn.
When a channel is a conditional additive channel, we have:
PBn|An(an+xn,yn|an)=PXn Yn(xn,yn).
By taking the output distributionQBn as:
QBn (an+xn,yn)=1|A|nQYn (yn)
for someQYn ∈P(Yn), as a corollary of Lemma 24, we have the following bound.
Lemma 25.
When a channel is a conditional additive channel, for any distributionQYn ∈P(Yn), we have:
Pc(n)(Mn)≥supγ≥0PXn YnlogQYn (yn)PXn Yn(xn,yn)>nlog|A|−γ−eγ Mn.
Proof.
By noting (215) and (216), the first term of the right-hand side of (214) can be rewritten as:
∑m=1Mn1MnPBn|AnlogPBn|An(bn|en(m))QBn (bn)<γ
=∑m=1Mn1MnPXn YnlogPBn|An(en(m)+xn,yn|en(m))QBn (en(m)+xn,yn)
=PXn YnlogQYn (yn)PXn Yn(xn,yn)>nlog|A|−γ,
which implies the statement of the lemma. □
A similar argument as in Theorem 5 also derives from the following converse bound.
Theorem 16.
For anyQYn ∈P(Yn), we have:
−logPc(n)(Mn)≤infs>0θ˜∈R,ϑ≥0[(1+s)θ˜H1+θ˜(PXn Yn|QYn )−H1+(1+s)θ˜(PXn Yn|QYn )
−(1+s)log1−2e−−ϑR+(θ˜+ϑ(1+θ˜))H1+θ˜+ϑ(1+θ˜)(PXn Yn|QYn )−(1+ϑ)θ˜H1+θ˜(PXn Yn|QYn )1+ϑ]/s≤infs>0−1<θ˜<θ(a(R))[(1+s)θ˜H1+θ˜(PXn Yn|QYn )−H1+(1+s)θ˜(PXn Yn|QYn )
−(1+s)log1−2e(θ(a(R))−θ˜)a(R)−θ(a(R))H1+θ(a(R))(PXn Yn|QYn )+θ˜H1+θ˜(PXn Yn|QYn )]/s,
whereR=nlog|A|−logMn, andθ(a)anda(R) are the inverse functions defined in (29) and (32), respectively.
Proof.
See Appendix L. □
4.6. Finite-Length Bound for the Markov Noise Channel
From this section, we address the conditional additive channel whose conditional additive noise is subject to the Markov chain. Here, the input alphabetAnequals the additive groupXn=Fqn, and the output alphabetBnisX× Yn. That is, the transition matrix describing the channel is given by using a transition matrix W onX× Ynand an initial distribution Q as:
PBn|An(xn+an,yn|an)=Q(x1,y1)∏i=2nW(xi,yi|xi−1,yi−1).
As in Section 2.2, we consider two assumptions on the transition matrix W of the noise process(X,Y) , i.e., Assumptions 1 and 2. We also use the same notations as in Section 2.2.
Example 6
(Gilbert–Elliot channel with state-information available at the receiver). The Gilbert–Elliot channel [29,30] is characterized by a channel stateYnonYn={0,1}nand an additive noiseXnonXn={0,1}n. The noise process(Xn,Yn)is a Markov chain induced by the transition matrix W introduced in Example 3. For the channel inputan, the channel output is given by(an+Xn,Yn)when the state-information is available at the receiver. Thus, this channel can be regarded as a conditional additive channel, and the transition matrix of the noise process satisfies Assumption 2.
Proofs of the following bounds are almost the same as those in Section 3.3, and thus omitted. The combination of Lemmas 10 and 22 derives the following achievability bound.
Theorem 17
(Direct, Ass. 1). Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. LetR:=n−knlog|A|. Then, we have:
−logP¯c(n,k)≥sup−1≤θ≤0−θnR+(n−1)θH1+θ↓,W(X|Y)+δ̲(θ).
Theorem 16 forQYn =PYn and Lemma 10 yield the following converse bound.
Theorem 18
(Converse, Ass. 1). Suppose that transition matrix W of the conditional additive noise satisfies Assumption 1. LetR:=log|A|−1nlogMn. IfHW(X|Y)<R<H0↓,W(X|Y), then we have:
−logPc(n)(Mn)≤infs>0−1<θ˜<θ(a(R))[(n−1)(1+s)θ˜H1+θ˜↓,W(X|Y)−H1+(1+s)θ˜↓,W(X|Y)+δ1
−(1+s)log1−2e(n−1)[(θ(a(R))−θ˜)a(R)−θ(a(R))H1+θ(a(R))↓,W(X|Y)+θ˜H1+θ˜↓,W(X|Y)]+δ2]/s,
whereθ(a)=θ↓(a)anda(R)=a↓(R) are the inverse functions defined by (67) and (70), respectively, and:
δ1:=(1+s)δ¯(θ˜)−δ̲((1+s)θ˜),
δ2:=(θ(a(R))−θ˜)R−(1+θ˜)δ̲(θ(a(R)))+(1+θ(a(R)))δ¯(θ˜)1+θ(a(R)).
Next, we derive tighter bounds under Assumption 2. From Lemmas 11 and 21, we have the following achievability bound.
Theorem 19
(Direct, Ass. 2). Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. LetR:=n−knlog|A|. Then, we have:
−logP¯c(n,k)≥sup−12≤θ≤0−θnR+(n−1)θH1+θ↑,W(X|Y)1+θ+ξ̲(θ).
By using Theorem 16 forQYn =PYn(1+θ(a(R)))and Lemma 12, we obtain the following converse bound.
Theorem 20
(Converse, Ass. 2). Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. LetR:=log|A|−1nlogMn. IfHW(X|Y)<R<H0↑,W(X|Y), we have:
−logPc(n)(Mn)≤infs>0−1<θ˜<θ(a(R))[(n−1)(1+s)θ˜H1+θ˜,1+θ(a(R))W(X|Y)−H1+(1+s)θ˜,1+θ(a(R))W(X|Y)+δ1
−(1+s)log1−2e(n−1)[(θ(a(R))−θ˜)a(R)−θ(a(R))H1+θ(a(R))↑,W(X|Y)+θ˜H1+θ˜,1+θ(a(R))W(X|Y)]+δ2]/s,
whereθ(a)=θ↑(a)anda(R)=a↑(R) are the inverse functions defined by (71) and (73), respectively, and:
δ1:=(1+s)ζ¯(θ˜,θ(a(R)))−ζ̲((1+s)θ˜,θ(a(R))),
δ2:=(θ(a(R))−θ˜)R−(1+θ˜)ζ̲(θ(a(R)),θ(a(R)))+(1+θ(a(R)))ζ¯(θ˜,θ(a(R)))1+θ(a(R)).
Finally, when X has no side-information, i.e., the channel is additive, we obtain the following achievability bound from Lemma 23.
Theorem 21
(Direct, no-side-information). LetR:=n−knlog|A|. Then, we have:
−logP¯c(n,k)≥sup−12≤θ≤0−θnR+(n−1)θH1+θW(X)+δ̲(θ)1+θ.
Remark 9.
Our treatment for the Markov conditional additive channel covers Markov regular channels because Markov regular channels can be reduced to Markov conditional additive channels as follows. LetX˜={X˜n}n=1∞be a Markov chain onBwhose distribution is given by:
PX˜n (x˜n)=Q(x˜1)∏i=2nW˜(x˜i|x˜i−1)
for a transition matrixW˜and an initial distribution Q. Let(X,Y)={(Xn,Yn)}n=1∞be the noise process of the conditional additive channel derived from the noise processX˜ of the regular channel by the argument of Section 4.2. Since we can write:
PXn Yn(xn,yn)=Q(ιy1−1(ϑy1 (x1)))1|Stb(0y1 )|∏i=2nW˜(ιyi−1(ϑyi (xi))|ιyi−1−1(ϑyi−1 (xi−1)))1|Stb(0yi )|,
the process(X,Y)is also a Markov chain. Thus, the regular channel given byX˜is reduced to the conditional additive channel given by(X,Y).
4.7. Second-Order
To discuss the asymptotic performance, we introduce the quantity:
C:=log|A|−HW(X|Y).
By applying the central limit theorem (cf. [67] (Theorem 27.4, Example 27.6)) to Lemmas 20 and 25 forQYn =PYn , and by using Theorem 2, we have the following.
Theorem 22.
Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. For arbitraryε∈(0,1), we have:
logM(n,ε)=k(n,ε)log|A|=Cn+VW(X|Y)Φ−1(ε)n+o(n).
Proof. This theorem follows in the same manner as the proof of Theorem 11 by replacing Lemma 13 with Lemma 20 (achievability) and Lemma 18 with Lemma 25 (converse). □
From the above theorem, the (first-order) capacity of the conditional additive channel under Assumption 1 is given by:
limn→∞1nlogM(n,ε)=limn→∞1nlogk(n,ε)log|A|n=C
for every0<ε<1. In the next subsections, we consider the asymptotic behavior of the error probability when the rate is smaller than the capacity in the moderate deviation regime and the large deviation regime, respectively.
4.8. Moderate Deviation From Theorems 17 and 18, we have the following.
Theorem 23.
Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. For arbitraryt∈(0,1/2)andδ>0, we have:
limn→∞−1n1−2tlogPc(n)enC−n1−tδ=limn→∞−1n1−2tlogP¯c(n)n,nC−n1−tδlog|A|
=δ22VW(X|Y).
Proof. The theorem follows in the same manner as Theorem 12 by replacing Theorem 6 with Theorem 17 (achievability) and Theorem 8 with Theorem 18 (converse). □
4.9. Large Deviation From Theorem 17 and Theorem 18, we have the following.
Theorem 24.
Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. ForHW(X|Y)<R, we have:
lim infn→∞−1nlogP¯c(n)n,n1−Rlog|A|≥sup−1≤θ≤0−θR+θH1+θ↓,W(X|Y).
On the other hand, forHW(X|Y)<R<H0↓,W(X|Y), we have:
lim supn→∞−1nlogPc(n)en(log|A|−R)≤−θ(a(R))a(R)+θ(a(R))H1+θ(a(R))↓,W(X|Y)
=sup−1<θ≤0−θR+θH1+θ↓,W(X|Y)1+θ.
Proof. The theorem follows in the same manner as Theorem 13 by replacing Theorem 6 with Theorem 17 (achievability) and Theorem 8 with Theorem 18 (converse). □
Under Assumption 2, from Theorems 19 and 20, we have the following tighter bound.
Theorem 25.
Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. ForHW(X|Y)<R, we have:
lim infn→∞−1nlogP¯c(n)n,n1−Rlog|A|≥sup−12≤θ≤0−θR+θH1+θ↑,W(X|Y)1+θ.
On the other hand, forHW(X|Y)<R<H0↑,W(X|Y), we have:
lim supn→∞−1nlogPc(n)en(log|A|−R)≤−θ(a(R))a(R)+θ(a(R))H1+θ(a(R))↑,W(X|Y)
=sup−1<θ≤0−θR+θH1+θ↑,W(X|Y)1+θ.
Proof. The theorem follows the same manner as Theorem 14 by replacing Theorem 9 with Theorem 19 and Theorem 10 with Theorem 20. □
When X has no side-information, i.e., the channel is additive, from Theorem 21 and (245), we have the following.
Theorem 26.
ForHW(X)<R, we have:
lim infn→∞−1nlogP¯c(n)n,n1−Rlog|A|≥sup−12≤θ≤0−θR+θH1+θW(X)1+θ.
On the other hand, forHW(X)<R<H0W(X), we have:
lim supn→∞−1nlogPc(n)en(log|A|−R)≤sup−1<θ≤0−θR+θH1+θW(X)1+θ.
Proof.
The first claim follows by taking the limit of Theorem 21, and the second claim follows as a special case of (245) without side-information. □
4.10. Summary of the Results
The results shown in this section for the Markov conditional additive noise are summarized in Table 3. The check marks ?? indicate that the tight asymptotic bounds (large deviation, moderate deviation, and second-order) can be obtained from those bounds. The marks ??*indicate that the large deviation bound can be derived up to the critical rate. The computational complexity “Tail” indicates that the computational complexities of those bounds depend on the computational complexities of tail probabilities. It should be noted that Theorem 18 is derived from a special case (QY=PY) of Theorem 16. The asymptotically optimal choice isQY=PY(1+θ). Under Assumption 1, we can derive the bound of the Markov case only for that special choice ofQY, while under Assumption 2, we can derive the bound of the Markov case for the optimal choice ofQY. Furthermore, Theorem 18 is not asymptotically tight in the large deviation regime in general, but it is tight if X has no side-information, i.e., the channel is additive. It should be also noted that Theorem 20 does not imply Theorem 18 even for the additive channel case since Assumption 2 restricts the structure of transition matrices even when X has no side-information.
5. Discussion and Conclusions
In this paper, we developed a unified approach to source coding with side-information and channel coding for a conditional additive channel for finite-length and asymptotic analyses of Markov chains. In our approach, the conditional Rényi entropies defined for transition matrices played important roles. Although we only illustrated the source coding with side-information and the channel coding for a conditional additive channel as applications of our approach, it could be applied to some other problems in information theory such as random number generation problems, as shown in another paper [60].
Our obtained results for the source coding with side-information and the channel coding of the conditional additive channel has been extended to the case when the side-information is continuous like the real line and the joint distribution X and Y is memoryless. Since this case covers the BPSK-AWGN channel, it can be expected that it covers the MPSK-AWGN channel. Since such channels are often employed in the real channel coding, it is an interesting future topic to investigate the finite-length bound for these channels. Further, we could not define the conditional Rényi entropy for transition matrices of continuous Y. Hence, our result could not be extended to such a continuous case. It is another interesting future topic to extend the obtained result to the case with continuous Y.
Problem | First-Order | Large Deviation | Moderate Deviation | Second-Order |
---|---|---|---|---|
SC with SI | Solved (Ass. 1) | Solved*(Ass. 2) | Solved (Ass. 1), | Solved (Ass. 1) |
O(1) | O(1) | Tail | ||
CC for Conditional | Solved (Ass. 1) | Solved*(Ass. 2) | Solved (Ass. 1) | Solved (Ass. 1) |
Additive Channels | O(1) | O(1) | Tail |
Ach./Conv. | Markov | Single-Shot | Ps/P¯s | Complexity | Large Deviation | Moderate Deviation | Second Order |
---|---|---|---|---|---|---|---|
Achievability | Theorem 6 (Ass. 1) | Lemma 15 | P¯s | O(1) | ?? | ||
Theorem 9 (Ass. 2) | Lemma 14 | P¯s | O(1) | ??* | ?? | ||
Theorem 7 (No-side) | Lemma 16 | P¯s | O(1) | ??* | ?? | ||
Lemma 13 | P¯s | Tail | ?? | ?? | |||
Converse | Theorem 8 (Ass. 1) | (Theorem 5) | Ps | O(1) | ?? | ||
Theorem 10 (Ass. 2) | Corollary 1 | Ps | O(1) | ??* | ?? | ||
Theorem 8 (No-side) | (Theorem 5) | Ps | O(1) | ??* | ?? | ||
Lemma 18 | Ps | Tail | ?? | ?? |
Ach./Conv. | Markov | Single-Shot | Pc/P¯c | Complexity | Large Deviation | Moderate Deviation | Second Order |
---|---|---|---|---|---|---|---|
Achievability | Theorem 17 (Ass. 1) | Lemma 22 | P¯c | O(1) | ?? | ||
Theorem 19 (Ass. 2) | Lemma 21 | P¯c | O(1) | ??* | ?? | ||
Theorem 21 (Additive) | Lemma 23 | P¯c | O(1) | ??* | ?? | ||
Lemma 20 | P¯c | Tail | ?? | ?? | |||
Converse | Theorem 18 (Ass. 1) | (Theorem 16) | Pc | O(1) | ?? | ||
Theorem 20 (Ass. 2) | Theorem 16 | Pc | O(1) | ??* | ?? | ||
Theorem 18 (Additive) | (Theorem 16) | Pc | O(1) | ??* | ?? | ||
Lemma 25 | Pc | Tail | ?? | ?? |
Author Contributions
Conceptualization, M.H.; methodology, S.W.; formal analysis, S.W. and M.H.; writing, original draft preparation, S.W.; writing, review and editing, M.H. All authors read and agreed to the published version of the manuscript.
Funding
M.H. is partially supported by the Japan Society of the Promotion of Science (JSPS) Grant-in-Aid for Scientific Research (A) No. 23246071, (A) No. 17H01280, (B) No. 16KT0017, the Okawa Research Grant, and Kayamori Foundation of Informational Science Advancement. He is also partially supported by the National Institute of Information and Communication Technology (NICT), Japan. S.W. is supported in part by the Japan Society of the Promotion of Science (JSPS) Grant-in-Aid for Young Scientists (A) No. 16H06091.
Acknowledgments
The authors would like to thank Vincent Y. F. Tan for pointing out Remark 6. The authors are also grateful to Ryo Yaguchi for his helpful comments.
Conflicts of Interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
RCU | random coding union |
BSC | binary symmetric channel |
DMC | discrete memoryless channel |
DT | dependence testing |
LDPC | low-density parity check |
BPSK | binary phase shift keying |
AWGN | additive white Gaussian noise |
CGF | cumulant generating function |
MPSK | M-ary phase shift keying |
CC | channel coding |
SC | source coding |
SI | side-information |
Appendix A. Preparation for the Proofs
When we prove some properties of Rényi entropies or derive converse bounds, some properties of cumulant generating functions (CGFs) become useful. For this purpose, we introduce some terminologies in statistics from [22,23]. Then, in Appendix B, we show the relation between the terminologies in statistics and those in information theory. For the proofs, see [22,23].
Appendix A.1. Single-Shot Setting
Let Z be a random variable with distribution P. Let:
ϕ(ρ):=logEeρZ
=log∑zP(z)eρZ
be the cumulant generating function (CGF). Let us introduce an exponential family:
Pρ(z):=P(z)eρz-ϕ(ρ).
By differentiating the CGF, we find that:
ϕ'(ρ)=Eρ[Z]
:=∑zPρ(z)z.
We also find that:
ϕ''(ρ)=∑zPρ(z)z-Eρ[Z]2.
We assume that Z is not constant. Then, (A6) implies thatϕ(ρ)is a strict convex function andϕ'(ρ)is monotonically increasing. Thus, we can define the inverse functionρ(a)ofϕ'(ρ)by:
ϕ'(ρ(a))=a.
Let:
D1+s(P∥Q):=1slog∑zP(z)1+sQ(z)-s
be the Rényi divergence. Then, we have the following relation:
sD1+s(Pρ˜∥Pρ)=ϕ((1+s)ρ˜-sρ)-(1+s)ϕ(ρ˜)+sϕ(ρ).
Appendix A.2. Transition Matrix
Let{W(z|z')}(z,z')∈Z2be an ergodic and irreducible transition matrix, and letP˜be its stationary distribution. For a functiong:Z×Z→R, let:
E[g]:=∑z,z'P˜(z')W(z|z')g(z,z').
We also introduce the following tilted matrix:
Wρ(z|z'):=W(z|z')eρg(z,z').
Letλρbe the Perron-Frobenius eigenvalue ofWρ. Then, the CGF for W with generator g is defined by:
ϕ(ρ):=logλρ.
Lemma A1.
The functionϕ(ρ)is a convex function of ρ, and it is strict convex iffϕ''(0)>0.
From Lemma A1,ϕ'(ρ)is a monotone increasing function. Thus, we can define the inverse functionρ(a)ofϕ'(ρ)by:
ϕ'(ρ(a))=a.
Appendix A.3. Markov Chain
LetZ={Zn}n=1∞be the Markov chain induced byW(z|z')and an initial distributionPZ1 . For functionsg:Z×Z→Randg˜:Z→R, letSn:=∑i=2ng(Zi,Zi-1)+g˜(Z1). Then, the CGF forSnis given by:
ϕn(ρ):=logEeρSn.
We will use the following finite evaluation forϕn(ρ).
Lemma A2.
Letvρbe the eigenvector ofWρTwith respect to the Perron-Frobenius eigenvalueλρsuch thatminz vρ(z)=1. Letwρ(z):=PZ1 (z)eρg˜(z). Then, we have:
(n-1)ϕ(ρ)+δ̲ϕ(ρ)≤ϕn(ρ)≤(n-1)ϕ(ρ)+δ¯ϕ(ρ),
where:
δ¯ϕ(ρ):=log〈vρ|wρ〉,
δ̲ϕ(ρ):=log〈vρ|wρ〉-logmaxzvρ(z).
From this lemma, we have the following.
Corollary A1.
For any initial distribution andρ∈R, we have:
limn→∞ϕn(ρ)=ϕ(ρ).
The relation:
limn→∞1nE[Sn]=ϕ'(0)
=E[g]
is well known. Furthermore, we also have the following.
Lemma A3.
For any initial distribution, we have:
limn→∞1nVarSn=ϕ''(0).
Appendix B. Relation Between CGF and Conditional Rényi Entropies
Appendix B.1. Single-Shot Setting
For correlated random variable(X,Y), let us considerZ=logQY(Y)PXY(X,Y). Then, the relation between the CGF and conditional Rényi entropy relative toQYis given by:
θH1+θ(PXY|QY)=-ϕ(-θ;PXY|QY).
From this, we can also find that the relationship between the inverse functions (cf. (29) and (A7)):
θ(a)=-ρ(a).
Thus, the inverse function defined in (32) also satisfies:
(1-ρ(a(R))a(R)+ϕ(ρ(a(R));PXY|QY)=R.
Similarly, by settingZ=log1PX|Y(X|Y), we have:
θH1+θ↓(X|Y)=-ϕ(-θ;PXY|PY).
Then, the variance (cf. (11)) satisfies:
V(X|Y)=ϕ''(0;PXY|PY).
Letϕ(ρ,ρ')be the CGF ofZ=logPY(1-ρ')(Y)PXY(X,Y) (cf. (15) for the definition ofPY(1-ρ')). Then, we have:
θH1+θ,1+θ'(X|Y)=-ϕ(-θ,-θ').
It should be noted thatϕ(ρ,ρ')is a CGF for fixedρ', butϕ(ρ,ρ)cannot be treated as a CGF.
Appendix B.2. Transition Matrix
For transition matrixW(x,y|x',y'), we consider the function given by:
g((x,y),(x',y')):=logW(y|y')W(x,y|x',y').
Then, the relation between the CGF and the lower conditional Rényi entropy is given by:
θH1+θ↓,W(X|Y)=-ϕ(-θ).
Then, the variance defined in (51) satisfies:
VW(X|Y)=ϕ''(0).
Appendix C. Proof of Lemma 2
We use the following lemma.
Lemma A4.
Forθ∈(-1,0)∪(0,1), we have:
H11-θ↓(X|Y)≤H11-θ↑(X|Y)≤H1+θ↓(X|Y).
Proof.
The left hand side inequality of (A31) is obvious from the definition of two Rényi entropies (the latter is defined by taking the maximum). The right-hand side inequality was proven in [71] (Lemma 6). □
Now, we go back to the proof of Lemma 2. From (10) and (11), by the Taylor approximation, we have:
H1+θ↓(X|Y)=H(X|Y)-12V(X|Y)θ+o(θ).
Furthermore, since11-θ=1+θ+o(θ), we also have:
H11-θ↓(X|Y)=H(X|Y)-12V(X|Y)θ+o(θ).
Thus, from Lemma A4, we can derive (16) and (17).
Appendix D. Proof of Lemma 3
Statements 1 and 3 follow from the relationships in (A22) and (A25) and the strict convexity of the CGFs.
To prove Statement 5, we first prove the strict convexity of the Gallager function:
E0(τ;PXY):=log∑yPY(y)∑xPX|Y (x|y)11+τ1+τ
forτ>-1. We use the Hölder inequality:
∑iaiα biβ≤∑iaiα ∑ibiβ
forα,β>0such thatα+β=1, where the equality holds iffai=cbifor some constant c. Forλ∈(0,1), let1+τ3=λ(1+τ1)+(1-λ)(1+τ2), which implies:
11+τ3=11+τ1λ(1+τ1)1+τ3+11+τ2(1-λ)(1+τ2)1+τ3
and:
λ(1+τ1)1+τ3+(1-λ)(1+τ2)1+τ3=1.
Then, by applying the Hölder inequality twice, we have:
∑yPY(y)∑xPX|Y (x|y)11+τ31+τ3
=∑yPY(y)∑xPX|Y (x|y)11+τ1λ(1+τ1)1+τ3 PX|Y (x|y)11+τ2(1-λ)(1+τ2)1+τ31+τ3
≤∑yPY(y)∑xPX|Y (x|y)11+τ1λ(1+τ1)1+τ3 ∑xPX|Y (x|y)11+τ2(1-λ)(1+τ2)1+τ31+τ3
=∑yPY(y)∑xPX|Y (x|y)11+τ1λ(1+τ1) ∑xPX|Y (x|y)11+τ2(1-λ)(1+τ2)
=∑yPY (y)λ ∑xPX|Y (x|y)11+τ1λ(1+τ1) PY (y)1-λ ∑xPX|Y (x|y)11+τ2(1-λ)(1+τ2)
≤∑yPY(y)∑xPX|Y (x|y)11+τ1(1+τ1)λ ∑yPY(y)∑xPX|Y (x|y)11+τ2(1+τ2)1-λ.
The equality in the second inequality holds iff:
∑xPX|Y (x|y)11+τ11+τ1=c∑xPX|Y (x|y)11+τ21+τ2∀y∈Y
for some constant c. Furthermore, the equality in the first inequality holds iffPX|Y(x|y)=1|supp(PX|Y(·|y))| . Substituting this into (A44), we find that|supp(PX|Y(·|y))|is irrespective of y. Thus, both the equalities hold simultaneously iffV(X|Y)=0. Now, since:
θH1+θ↑(X|Y)=-(1+θ)E0-θ1+θ;PXY,
we have:
d2[θH1+θ↑(X|Y)]dθ2=-1(1+θ)4E0''-θ1+θ;PXY
≤0
forθ∈(-1,∞), where the equality holds iffV(X|Y)=0.
Statement 7 is obvious from the definitions of the two measures. The first part of Statement 8 follows from (A27) and the convexity of the CGF, but we need another argument to check the conditions for strict concavity. Since the second term of:
θH1+θ,1+θ'(X|Y)=-log∑yPY(y)∑xPX|Y (x|y)1+θ∑xPX|Y (x|y)1+θ'θ1+θ'+θθ'1+θ'H1+θ'↑(X|Y)
is linear with respect toθ, it suffices to show the strict concavity of the first term. By using the Hölder inequality twice, forθ3=λθ1+(1-λ)θ2, we have:
∑yPY(y)∑xPX|Y (x|y)1+θ3∑xPX|Y (x|y)1+θ'θ31+θ'
≤∑yPY(y)∑xPX|Y (x|y)1+θ1λ ∑xPX|Y (x|y)1+θ21-λ ∑xPX|Y (x|y)1+θ'λθ1+(1-λ)θ21+θ'
≤∑yPY(y)∑xPX|Y (x|y)1+θ1∑xPX|Y (x|y)1+θ'θ11+θ'λ
∑yPY(y)∑xPX|Y (x|y)1+θ2∑xPX|Y (x|y)1+θ'θ21+θ'1-λ,
where both the equalities hold simultaneously iffV(X|Y)=0, which can be proven in a similar manner as the equality conditions in (A40) and (A43). Thus, we have the latter part of Statement 8.
Statements 10-12 are also obvious from the definitions. Statements 2, 4, 6, and 9, follow from Statements 1, 3, 5, and 8, (cf. [71], Lemma 1).
Appendix E. Proof of Lemma 4
Since (24) and (28) are obvious from the definitions, we only prove (26). We note that:
∑yPY(y)∑xPX|Y (x|y)1+θ11+θ1+θ
≤∑yPY(y)|supp(PX|Y(·|y))|11+θ1+θ
≤maxy∈supp(PY)|supp(PX|Y(·|y))|
and:
∑yPY(y)∑xPX|Y (x|y)1+θ11+θ1+θ
≥PY (y*)1+θ∑xPX|Y (x|y*)1+θ
→θ→-1|supp(PX|Y(·|y*))|,
where:
y*:=argmaxy∈supp(PY)|supp(PX|Y(·|y))|.
Appendix F. Proof of Lemma 6
From Lemma A4, Theorems 1 and 3, we have:
H11-θ↓,W(X|Y)≤H11-θ↑,W(X|Y)≤H1+θ↓,W(X|Y)
forθ∈(-1,0)∪(0,1). Thus, we can prove Lemma 6 in the same manner as Lemma 2.
Appendix G. Proof of (63)
First, in the same manner as Theorem 1, we can show:
limn→∞1nH1+θ(PXn Yn|QYn )=H1+θW|V(X|Y),
whereQYn is a Markov chain induced by V for some initial distribution. Then, sinceH1+θ(PXn Yn|QYn )≤H1+θ↑(Xn|Yn)for each n, by using Theorem 3, we have:
H1+θW|V(X|Y)≤H1+θ↑,W(X|Y).
Thus, the rest of the proof is to show thatH1+θ↑,W(X|Y)is attainable by some V.
LetQ^θbe the normalized left eigenvector ofKθ, and let:
Vθ(y|y'):=Q^θ(y)κθ Q^θ(y')Kθ(y|y').
Then,Vθattains the maximum. To prove this, we will show thatκθ1+θis the Perron-Frobenius eigenvalue of:
W(x,y|x',y')1+θ Vθ (y|y')-θ.
We first confirm that(Q^θ (y)1+θ:(x,y)∈X×Y) is an eigenvector of (A64) as follows:
∑x,yQ^θ (y)1+θW(x,y|x',y')1+θ Vθ (y|y')-θ
=∑yQ^θ (y)1+θ Wθ(y|y')Q^θ(y)κθ Q^θ(y')Wθ (y|y')11+θ-θ
=κθθ Q^θ (y')θ∑yQ^θ(y)Wθ (y|y')11+θ
=κθ1+θ Q^θ (y')1+θ.
Since(Q^θ (y)1+θ:(x,y)∈X×Y)is a positive vector and the Perron-Frobenius eigenvector is the unique positive eigenvector, we find thatκθ1+θis the Perron-Frobenius eigenvalue. Thus, we have:
H1+θW|Vθ(X|Y)=-1+θθlogκθ
=H1+θ↑,W(X|Y).
Appendix H. Proof of Lemma 7
Statement 1 follows from (A29) and the strict convexity of the CGF. Statements 5 and 8-10 follow from the corresponding statements in Lemma 3, Theorems 1, 3 and 4.
Now, we prove (the concavity ofθH1+θ↑,W(X|Y)follows from the limiting argument, i.e., the concavity ofθH1+θ↑(Xn|Yn)(cf. Lemma 3) and Theorem 3. However, the strict concavity does not follow from the limiting argument; Statement 3. For this purpose, we introduce the transition matrix counterpart of the Gallager function as follows. Let:
K¯τ(y|y'):=W(y|y')∑xW(x|x',y',y)11+τ1+τ
forτ>-1, which is well defined under Assumption 2. Letκ¯τbe the Perron-Frobenius eigenvalue ofK¯τ, and letQ˜τandQ^τbe its normalized right and left eigenvectors. Then, let:
Lτ(y|y'):=Q^τ(y)κ¯τ Q^τ(y')K¯τ(y|y')
be a parametrized transition matrix. The stationary distribution ofLτis given by:
Qτ(y'):=Q^τ(y')Q˜τ(y')∑y'' Q^τ(y'')Q˜τ(y'').
We prove the strict convexity ofE0W(τ):=logκ¯τforτ>-1 . Then, by the same reason as (A46), we can show Statement 3. LetQτ(y,y'):=Lτ(y|y')Qτ(y') . By the same calculation as [22] (Proof of Lemmas 13 and 14), we have:
∑y,y'Qτ(y,y')ddτlogLτ(y|y')2=-∑y,y'Qτ(y,y')d2dτ2logLτ(y|y').
Furthermore, from the definition ofLτ, we have:
-∑y,y'Qτ(y,y')d2dτ2logLτ(y|y')
=-∑y,y'Qτ(y,y')d2dτ2log1κτ+d2dτ2logQ^τ(y)Q^τ(y')+d2dτ2logKτ(y|y')
=d2dτ2logκτ-∑y,y'Qτ(y,y')d2dτ2logKτ(y|y').
Now, we show the convexity oflogK¯τ(y|y')for each(y,y') . By using the Hölder inequality (cf. Appendix D), forτ3=λτ1+(1-λ)τ2, we have:
∑xW(x|x',y',y)11+τ31+τ3≤∑xW(x|x',y',y)11+τ1λ(1+τ1) ∑xW(x|x',y',y)11+τ2(1-λ)(1+τ2).
Thus,E0W(τ) is convex. To check strict convexity, we note that the equality in (A78) holds iffW(x|x',y',y)=1|supp(W(·|x',y',y))|. Since:
∑xW(x|x',y',y)1+θ=1|supp(W(·|x',y',y))|θ
does not depend onx'from Assumption 2, we have|supp(W(·|x',y',y))|=Cyy'for some integerCyy'. By substituting this intoK¯τ, we have:
K¯τ(y|y')=W(y|y')Cyy'τ.
On the other hand, we note that the CGFϕ(ρ)is defined as the logarithm of the Perron-Frobenius eigenvalue of:
W(x,y|w',y')1-ρW(y|y')ρ=W(y|y')1Cyy'1-ρ1[x∈supp(W(·|x',y',y))].
Since:
∑x,yQ^τ(y)W(y|y')1Cyy'1-τ1[x∈supp(W(·|x',y',y))]
=∑yQ^τ(y)W(y|y')Cyy'τ
=κ¯τ Q^τ(y'),
κ¯τ is the Perron-Frobenius eigenvalue of (A81), and thus, we haveE0W(τ)=ϕ(τ) when the equality in (A78) holds for every(y,y')such thatW(y|y')>0. Sinceϕ(τ)is strict convex ifVW(X|Y)>0,E0W(τ)is strict convex ifVW(X|Y)>0. Thus,θH1+θ↑,W(X|Y)is strict concave ifVW(X|Y)>0 . On the other hand, from (57),θH1+θ↑,W(X|Y)is strict concave only ifVW(X|Y)>0.
Statement 6 can be proven by modifying the proof of Statement 8 of Lemma 3 to a transition matrix in a similar manner as Statement 3 of the present lemma.
Finally, Statements 2, 4 and 7 follow from Statements 1, 3 and 6 (cf. [71], Lemma 1).
Appendix I. Proof of Lemma 9
We only prove (75) since we can prove (76) exactly in the same manner by replacingH1+θ↓,W(X|Y),θ↓(a), anda↓(R)byH1+θ↑,W(X|Y),θ↑(a), anda↑(R). Let:
f(θ):=-θR+θH1+θ↓,W(X|Y)1+θ.
Then, we have:
f'(θ)=-R+(1+θ)d[θH1+θ↓,W(X|Y)]dθ-θH1+θ↓,W(X|Y)(1+θ)2
=-R+Rd[θH1+θ↓,W(X|Y)]dθ(1+θ)2.
SinceR(a)is monotonically increasing andd[θH1+θ↓,W(X|Y)]dθis monotonically decreasing, we havef'(θ)≥0forθ≤θ(a(R))andf'(θ)≤0forθ≥θ(a(R)). Thus,f(θ)takes its maximum atθ(a(R)). Furthermore, since-1≤θ(a(R))≤0forHW(X|Y)≤R≤H0↓,W(X|Y), we have:
sup-1≤θ≤0-θR+θH1+θ↓,W(X|Y)1+θ
=-θ(a(R))R+θ(a(R))H1+θ(a(R))↓,W(X|Y)1+θ(a(R))
=-θ(a(R))[(1+θ(a(R)))a(R)-θ(a(R))H1+θ(a(R))↓,W(X|Y)]+θ(a(R))H1+θ(a(R))↓,W(X|Y)1+θ(a(R))
=-θ(a(R))a(R)+θ(a(R))H1+θ(a(R))↓,W(X|Y),
where we substitutedR=R(a(R))in the second equality.
Appendix J. Proof of Lemma 11
Let u be the vector such thatu(y)=1for everyy∈Y. From the definition ofH1+θ↑(Xn|Yn), we have the following sequence of calculations:
e-θ1+θθH1+θ↑(Xn|Yn)
=∑y1,...,yn∑xn,...,x1P(x1,y1)1+θ∏i=2nW(xi,yi|xi-1,yi-1)1+θ11+θ
=(a)∑yn,...,y1∑x1P(x1,y1)1+θ11+θ∏i=2nWθ (yi|yi-1)11+θ
=〈u|Kθn-1 wθ〉
≤〈vτ|Kθn-1 wθ〉
=〈(KθT)n-1 vθ|wθ〉
=κθn-1〈vθ|wθ〉
=e-(n-1)θ1+θH1+θ↑,W(X|Y)〈vθ|wθ〉,
which implies the left-hand side inequality, where we used Assumption 2 in(a). On the other hand, we have the following sequence of calculations:
e-θ1+θθH1+θ↑(Xn|Yn)
=〈u|Kθn-1 wθ〉
≥1maxy vθ(y)〈vθ|Kθn-1 wθ〉
=1maxy vθ(y)〈(KθT)n-1 vθ|wθ〉
=κθn-1〈vθ|wθ〉maxy vθ(y)
=e-(n-1)θ1+θH1+θ↑,W(X|Y)〈vθ|wθ〉maxy vθ(y),
which implies the right-hand side inequality.
Appendix K. Proof of Theorem 5
For arbitraryρ˜∈R, we setα:=PXY{X≠d(e(X),Y)}andβ:=PXY,ρ˜{X≠d(e(X),Y)}, where:
PXY,ρ(x,y):=PXY (x,y)1-ρ QY (y)ρ e-ϕ(ρ;PXY|QY).
Then, by the monotonicity of the Rényi divergence, we have:
sD1+s(PXY,ρ˜∥PXY)≥logβ1+s α-s+(1-β)1+s (1-α)-s
≥logβ1+s α-s.
Thus, we have:
-logα≤ϕ((1+s)ρ˜;PXY|QY)-(1+s)ϕ(ρ˜;PXY|QY)-(1+s)logβs.
Now, by using Lemma 18, we have:
1-β≤PXY,ρ˜logQY(y)PXY,ρ˜(x,y)≤γ+Meγ.
We also have, for anyσ≤0,
PXY,ρ˜logQY(y)PXY,ρ˜(x,y)≤γ
≤∑x,yPXY,ρ˜(x,y)eσlogQY(y)PXY,ρ˜(x,y)-γ
=e-[σγ-ϕ(σ;PXY,ρ˜|QY)].
Thus, by settingγso that:
σγ-ϕ(σ;PXY,ρ˜|QY)=γ-R,
we have
1-β≤2e-σR-ϕ(σ;PXY,ρ˜|QY)1-σ.
Furthermore, we have the relation:
ϕ(σ;PXY,ρ˜|QY)=log∑x,yPXY,ρ˜ (x,y)1-σ QY (y)σ
=log∑x,yPXY (x,y)1-ρ˜ QY (y)ρ˜ e-ϕ(ρ˜;PXY|QY)1-σ QY (y)σ
=-(1-σ)ϕ(ρ˜;PXY|QY)+log∑x,yPXY (x,y)1-ρ˜-σ(1-ρ˜) QY (y)ρ˜+σ(1-ρ˜)
=ϕ(ρ˜+σ(1-ρ˜);PXY|QY)-(1-σ)ϕ(ρ˜;PXY|QY).
Thus, by substitutingρ˜=-θ˜andσ=-ϑ and by using (A22), we can derive (124).
Now, we restrict the range ofρ˜so thatρ(a(R))<ρ˜<1and take:
σ=ρ(a(R))-ρ˜1-ρ˜.
Then, by substituting this into (A119) and (A119) into (A115), we have (ϕ(ρ;PXY|QY)is omitted asϕ(ρ)):
σR-ϕ(ρ˜+σ(1-ρ˜))+(1-σ)ϕ(ρ˜)1-σ
=(ρ(a(R))-ρ˜)R-(1-ρ˜)ϕ(ρ(a(R)))+(1-ρ(a(R)))ϕ(ρ˜)1-ρ(a(R))
=(ρ(a(R))-ρ˜)(1-ρ(a(R)))a(R)+ϕ(ρ(a(R)))-(1-ρ˜)ϕ(ρ(a(R)))+(1-ρ(a(R)))ϕ(ρ˜)1-ρ(a(R))
=(ρ(a(R))-ρ˜)a(R)-ϕ(ρ(a(R)))+ϕ(ρ˜),
where we used (A24) in the second equality. Thus, by substitutingρ˜=-θ˜ and by using (A22) again, we have (125).
Appendix L. Proof of Theorem 16
Let:
PXn Yn,ρ(xn,yn):=PXn Yn (xn,yn)1-ρ QYn (yn)ρ e-ϕ(ρ;PXn Yn|QYn ),
and letPBn|An,ρbe a conditional additive channel defined by:
PBn|An,ρ(an+xn|an)=PXn Yn,ρ(xn,yn).
We also define the joint distribution of the message, the input, the output, and the decoded message for each channel:
PMn An Bn M^n(m,an,bn,m^):=1Mn1[en(m)=an]PBn|An(bn|an)1[dn(bn)=m^],
PMn An Bn M^n,ρ(m,an,bn,m^):=1Mn1[en(m)=an]PBn|An,ρ(bn|an)1[dn(bn)=m^].
For arbitraryρ˜∈R, letα:=PMn M^n{m≠m^}andβ:=PMn M^n,ρ˜{m≠m^}. Then, by the monotonicity of the Rényi divergence, we have:
sD1+s(PAn Bn,ρ˜∥PAn Bn)≥sD1+s(PMn M^n,ρ˜∥PMn M^n)
≥logβ1+s α-s+(1-β)1+s (1-α)-s
≥logβ1+s α-s.
Thus, we have:
-logα≤sD1+s(PAn Bn,ρ˜∥PAn Bn)-(1+s)logβs.
Here, we have:
D1+s(PAn Bn,ρ˜∥PAn Bn)=D1+s(PXn Yn,ρ˜∥PXn Yn).
On the other hand, from Lemma 25, we have:
1-β≤PXn Yn,ρ˜logQYn (yn)PXn Yn,ρ˜(xn,yn)≤nlog|A|-γ+eR enlog|A|-γ.
Thus, by the same argument as in (A111)-(A119) and by noting (A22), we can derive (222).
Now, we restrict the range ofρ˜so thatρ(a(R))<ρ˜<1and take:
σ=ρ(a(R))-ρ˜1-ρ˜.
Then, by noting (A22), we have (223).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
We derive finite-length bounds for two problems with Markov chains: source coding with side-information where the source and side-information are a joint Markov chain and channel coding for channels with Markovian conditional additive noise. For this purpose, we point out two important aspects of finite-length analysis that must be argued when finite-length bounds are proposed. The first is the asymptotic tightness, and the other is the efficient computability of the bound. Then, we derive finite-length upper and lower bounds for the coding length in both settings such that their computational complexity is low. We argue the first of the above-mentioned aspects by deriving the large deviation bounds, the moderate deviation bounds, and second-order bounds for these two topics and show that these finite-length bounds achieve the asymptotic optimality in these senses. Several kinds of information measures for transition matrices are introduced for the purpose of this discussion.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer