Finite-Length Analyses for Source and Channel

Full text

Turn on search term navigation

1. Introduction

In recent years, finite-length analyses for coding problems have been attracting considerable attention [1]. This paper focuses on finite-length analyses for two representative coding problems: One is source coding with side-information for Markov sources, i.e., the Markov–Slepian–Wolf problem on the system^Xnwith full side-information^Ynat the decoder, where only the decoder observes the side-information and the source and the side-information are a joint Markov chain. The other is channel coding for channels with Markovian conditional additive noise. Although the main purpose of this paper is finite-length analyses, we also present a unified approach we developed to investigate these topics including asymptotic analyses. Since this discussion is spread across a number of subtopics, we explain them separately in the Introduction.

1.1. Two Aspects of Finite-Length Analysis

We explain the motivations of this research by starting with two aspects of finite-length analysis that must be argued when finite-length bounds are proposed. For concreteness, we consider channel coding here even though the problems treated in this paper are not restricted to channel coding. To date, many types of finite-length achievability bounds have been proposed. For example, Verdú and Han derived a finite-length bound by using the information-spectrum approach in order to derive the general formula [2] (see also [3]), which we term as the information-spectrum bound. One of the authors and Nagaoka derived a bound (for the classical-quantum channel) by relating the error probability to binary hypothesis testing [4] (Remark 15) (see also [5]), which we refer to as the hypothesis-testing bound. Polyanskiy et. al. derived the random coding union (RCU) bound and the dependence testing (DT) bound [1] (a bound slightly looser (coefficients are worse) than the DT bound can be derived from the hypothesis-testing bound of [4]). Moreover, Gallager’s bound [6] is known as an efficient bound to derive the exponentially decreasing rate.

Here, we focus on two important aspects of finite-length analysis:

(A1) Computational complexity for the bound and

(A2) Asymptotic tightness for the bound.

Both aspects are required for the bound in finite-length analysis as follows. As the first aspect, we consider the computational complexity for the bound. For the BSC (binary symmetric channel), the computational complexity of the RCU bound isO(ⁿ²), and that of the DT bound isO(n) [7]. However, the computational complexities of these bounds are much larger for general DMCs (discrete memoryless channels) or channels with memory. It is known that the hypothesis testing bound can be described as a linear programming problem (e.g., see [8,9] (in the the case of a quantum channel, the bound is described as a semi-definite programming problem)) and can be efficiently computed under certain symmetry. However, the number of variables in the linear programming problem grows exponentially with the block length, and it is difficult to compute in general. The computation of the information-spectrum bound depends on the evaluation of the tail probability. The hypothesis testing bound gives a tighter bound than the information-spectrum bound, as pointed out by [8], and the computational complexity of the former is much smaller than that of the latter. However, the computation of the tail probability continues to remain challenging unless the channel is a DMC. For DMCs, the computational complexity of Gallager’s bound isO(1) since the Gallager function is an additive quantity for DMCs. However, this is not the case if there is a memory (the Gallager bound for finite-state channels was considered in [10] (Section 5.9), but a closed form expression for the exponent was not derived). Consequently, no efficiently computable bound currently exists for channel coding with Markov additive noise. The situation is the same for source coding with side-information.

Since the actual computation time may depend on the computational resource we can use for numerical experiment, it is not possible to provide a concrete requirement of computational complexity. However, in order to conduct a numerical experiment for a meaningful blocklength, it is reasonable to require the computational complexity to be, at most, a polynomial order of the blocklength n.

Next, let us consider the second aspect, i.e., asymptotic tightness. Thus far, three kinds of asymptotic regimes have been studied in information theory [1,11,12,13,14,15,16]:

A large deviation regime in which the error probabilityεasymptotically behaves as^e−nrfor somer>0;
A moderate deviation regime in whichεasymptotically behaves as^{e−^n1−2tr}for somer>0andt∈(0,1/2); and
A second-order regime in whichεis a constant.

We shall claim that a good finite-length bound should be asymptotically optimal for at least one of the above-mentioned three regimes. In fact, the information-spectrum bound, the hypothesis-testing bound, and the DT bound are asymptotically optimal in both the moderate deviation and second-order regimes, whereas the Gallager bound is asymptotically optimal in the large deviation regime and the RCU bound asymptotically optimal in all the regimes (Both the Gallager and RCU bounds are asymptotically optimal in the large deviation regime only up to the critical rate). Recently, for DMCs, Yang and Meng derived an efficiently computable bound for low-density parity check (LDPC) codes [17], which is asymptotically optimal in both the moderate deviation and second-order regimes.

1.2. Main Contribution for Finite-Length Analysis

We derive the finite-length achievability bounds on the problems by basically using the exponential-type bounds (for channel coding, it corresponds to the Gallager bound.). In source coding with side-information, the exponential-type upper bounds on error probability_P¯e(_Mn)for a given message size_Mnare described by using the conditional Rényi entropies as follows (cf. Lemmas 14 and 15):

_P¯e(_Mn)≤inf−12≤θ≤0_Mnθ1+θ ^{e−θ1+θ_H1+θ↑(^Xn|^Yn)}

and:

_P¯e(_Mn)≤inf−1≤θ≤0_Mnθ ^{e−θ_H1+θ↓(^Xn|^Yn)}.

Here,^Xnis the information to be compressed and^Ynis the side-information that can be accessed only by the decoder._H1+θ↑(^Xn|^Yn) is the conditional Rényi entropy introduced by Arimoto [18], which we shall refer to as the upper conditional Rényi entropy (cf. (12)). On the other hand,_H1+θ↓(^Xn|^Yn) is the conditional Rényi entropy introduced in [19], which we shall refer to as the lower conditional Rényi entropy (cf. (7)). Although there are several other definitions of conditional Rényi entropies, we only use these two in this paper; see [20,21] for an extensive review on conditional Rényi entropies.

Although the above-mentioned conditional Rényi entropies are additive for i.i.d. random variables, they are not additive for joint Markov chains over^Xnand^Yn , for which the derivation of finite-length bounds for Markov chains are challenging. Because it is generally not easy to evaluate the conditional Rényi entropies for Markov chains, we consider two assumptions in relation to transition matrices: the first assumption, which we refer to as non-hidden, is that the Y-marginal process is a Markov chain, which enables us to derive the single-letter expression of the conditional entropy rate and the lower conditional Rényi entropy rate; the second assumption, which we refer to as strongly non-hidden, enables us to derive the single-letter expression of the upper conditional Rényi entropy rate; see Assumptions 1 and 2 of Section 2 for more detail (Indeed, as explained later, our result on the data compression can be converted to a result on the channel coding for a specific class of channels. Under this conversion, we obtain certain assumptions for channels. As explained later, these assumptions for channels are more meaningful from a practical point of view.). Under Assumption 1, we introduce the lower conditional Rényi entropy for transition matrices_H1+θ↓,W(X|Y) (cf. (47)). Then, we evaluate the lower conditional Rényi entropy for the Markov chain in terms of its transition matrix counterpart. More specifically, we derive an approximation:

_H1+θ↓(^Xn|^Yn)=n_H1+θ↓,W(X|Y)+O(1),

where an explicit form of theO(1) term is also derived. Using the evaluation (2) with this evaluation, we obtain finite-length bounds under Assumption 1. Under a more restrictive assumption, i.e., Assumption 2, we also introduce the upper conditional Rényi entropy for a transition matrix_H1+θ↑,W(X|Y) (cf. (55)). Then, we evaluate the upper Rényi entropy for the Markov chain in terms of its transition matrix counterpart. More specifically, we derive an approximation:

_H1+θ↑(^Xn|^Yn)=n_H1+θ↑,W(X|Y)+O(1),

where an explicit form of theO(1) term is also derived. Using the evaluation (1) with this evaluation, we obtain finite-length bounds that are tighter than those obtained under Assumption 1. It should be noted that, without Assumption 1, even the conditional entropy rate is challenging to evaluate. For evaluation of the conditional entropy rate of the X process given the Y process, the assumption of the X process being Markov seems to be not helpful. This is the reason why we consider the Y process being Markov instead of the X process being Markov in this paper.

We also derive converse bounds by using the change of measure argument for Markov chains developed by the authors in the accompanying paper on information geometry [22,23]. For this purpose, we further introduce two-parameter conditional Rényi entropy and its transition matrix counterpart (cf. (18) and (59)). This novel information measure includes the lower conditional Rényi entropy and the upper conditional Rényi entropy as special cases. We clarify the relation among bounds based on these quantities by numerically calculating the upper and lower bounds for the optimal coding rate in source coding with a Markov source in Section 3.7. Owing to the second aspect (A2), this calculation shows that our finite-length bounds are very close to the optimal value. Although this numerical calculation contains a case with a very large sizen=1×¹⁰⁵, its calculation is not as difficult because the calculation complexity behaves asO(1). That is, this calculation shows the advantage of the first aspect (A1).

Here, we would like to remark about the terminologies because there are a few ways to express exponential-type bounds. In statistics or large deviation theory, we usually use the cumulant generating function (CGF) to describe exponents. In information theory, we employ the Gallager function or the Rényi entropies. Although these three terminologies are essentially the same quantity and are related by the change of variables, the CGF and the Gallager function are convenient for some calculations because of their desirable properties such as convexity. On the other hand, the minimum entropy and collision entropy are often used as alternative information measures of Shannon entropy in the community of cryptography. Since the Rényi entropies are a generalization of the minimum entropy and collision entropy, we can regard the Rényi entropies as information measures. The information theoretic meaning of the CGF and the Gallager function are less clear. Thus, the Rényi entropies are intuitively familiar to the readers’ of this journal. The Rényi entropies have an additional advantage in that two types of bounds (e.g., (152) and (161)) can be expressed in a unified manner. Therefore, we state our main results in terms of the Rényi entropies, whereas we use the CGF and the Gallager function in the proofs. For the readers’ convenience, the relation between the Rényi entropies and corresponding CGFs are summarized in Appendix A and Appendix B.

1.3. Main Contribution for Channel Coding

An intimate relationship is known to exist between channel coding and source coding with side-information (e.g., [24,25,26]). In particular, for an additive channel, the error probability of channel coding by a linear code can be related to the corresponding source coding problem with side-information [24]. Chen et. al. also showed that the error probability of source coding with side-information by a linear encoder can be related to the error probability of a dual channel coding problem and vice versa [27] (see also [28]). Since these dual channels can be regarded as additive channels conditioned on state information, we refer to these channels as conditional additive channels (In [28], we termed these channels general additive channels, but we think “conditional” more suitably describes the situation.). In this paper, we mainly discuss a conditional additive channel, in which the additive noise is operated subject to a distribution conditioned on additional output information. Then, we convert our obtained results of source coding with side-information to the analysis on conditional additive channels. That is, using the aforementioned duality between channel coding and source coding with side-information enables us to evaluate the error probability of channel coding for additive channels. Then, we derive several finite-length analyses on additive channels.

For the same reason as source coding with side-information, we make two assumptions, Assumptions 1 and 2, on the noise process of a conditional additive channel. In this context, Assumption 1 means that the marginal system^Yndeciding the behavior of the additive noise^Xn is a Markov chain. It should be noted that the Gilbert–Elliott channel [29,30] with state information available at the receiver can be regarded as a conditional additive channel such that the noise process is a Markov chain satisfying both Assumptions 1 and 2 (see Example 6). Thus, we believe that Assumptions 1 and 2 are quite reasonable assumptions.

In fact, our analysis is applicable for a broader class of channels known as regular channels [31]. The class of regular channels includes conditional additive channels as a special case, and it is known as a class of channels that are similarly symmetrical. To show it, we propose a method to convert a regular channel into a conditional additive channel such that our treatment covers regular channels. Additionally, we show that the BPSK (binary phase shift keying)-AWGN (additive white Gaussian noise) channel is included in conditional additive channels.

1.4. Asymptotic Bounds and Asymptotic Tightness for Finite-Length Bounds

We present asymptotic analyses of the large and moderate deviation regimes by deriving the characterizations (for the large deviation regime, we only derive the characterizations up to the critical rate) with the use of our finite-length achievability and converse bounds, which implies that our finite-length bounds are tight in both of these deviation regimes. We also derive the second-order rate. Although this rate can be derived by the application of the central limit theorem to the information-spectrum bound, the variance involves the limit with respect to the block length because of memory. In this paper, we derive a single-letter form of the variance by using the conditional Rényi entropy for transition matrices (An alternative way to derive a single-letter characterization of the variance for the Markov chain was shown in [32] (Lemma 20). It should also be noted that a single-letter characterization can be derived by using the fundamental matrix [33]. The single-letter characterization of the variance in [12] (Section VII) and [11] (Section III) contains an error, which is corrected in this paper.).

As we will see in Theorems 11–14 and 22–25, our asymptotic results have the same forms as the counterparts of the i.i.d. case (cf. [1,6,11,12,13,14]) when the information measures for distributions in the i.i.d. case are replaced by the information measures for the transition matrices introduced in this paper.

We determine the asymptotic tightness for finite-length bounds by summarizing the relation between the asymptotic results and the finite-length bounds in Table 1. The table also describes the computational complexity of the finite-length bounds. “^Solved*” indicates that those problems are solved up to the critical rates. “Ass. 1” and “Ass. 2” indicate that those problems are solved either under Assumption 1 or Assumption 2. “O(1)” indicates that both the achievability and converse parts of those asymptotic results are derived from our finite-length achievability bounds and converse bounds whose computational complexities areO(1). “Tail” indicates that both the achievability and converse parts of those asymptotic results are derived from the information-spectrum-type achievability bounds and converse bounds of which the computational complexities depend on the computational complexities of the tail probabilities.

In general, the exact computations of tail probabilities are difficult, although they may be feasible for a simple case such as an i.i.d. case. One way to compute tail probabilities approximately is to use the Berry–Esséen theorem [34] (Theorem 16.5.1) or its variant [35]. This direction of research is still ongoing [36,37], and an evaluation of the constant was conducted [37], although its tightness has not been clarified. If we can derive a tight Berry–Esséen-type bound for the Markov chain, this would enable us to derive a finite-length bound that is asymptotically tight in the second-order regime. However, the approximation errors of Berry–Esséen-type bounds converge only in the order of1/nand cannot be applied whenεis rather small. Even in cases in which the exact computations of tail probabilities are possible, the information-spectrum-type bounds are looser than the exponential type bounds whenεis rather small, and we need to use appropriate bounds depending on the size ofε . In fact, this observation was explicitly clarified in [38] for random number generation with side-information. Consequently, we believe that our exponential-type finite-length bounds are very useful. It should be also noted that, for source coding with side-information and channel coding for regular channels, even the first-order results have not been revealed as far as the authors know, and they are clarified in this paper (General formulae for those problems were known [2,3], but single-letter expressions for Markov sources or channels were not clarified in the literature. For the source coding without side-information, the single-letter expression for entropy rate of Markov source is well known (e.g., see [39]).).

1.5. Related Work on Markov Chains

Since related work concerning the finite-length analysis is reviewed in Section 1.1, we only review work related to the asymptotic analysis here. Some studies on Markov chains for the large deviation regime have been reported [40,41,42]. The derivation in [40] used the Markov-type method. A drawback of this method is that it involves a term that stems from the number of types, which does not affect the asymptotic analysis, but does hurt the finite-length analysis. Our achievability is derived by following a similar approach as in [41,42], i.e., the Perron–Frobenius theorem, but our derivation separates the single-shot part and the evaluation of the Rényi entropy, and thus is more transparent. Furthermore, the converse part of [41,42] is based on the Shannon–McMillan–Breiman limiting theorem and does not yield finite-length bounds.

For the second-order regime, Polyanskiy et. al. studied the second-order rate (dispersion) of the Gilbert–Elliott channel [43]. Tomamichel and Tan studied the second-order rate of channel coding with state information such that the state information may be a general source and derived a formula for the Markov chain as a special case [32]. Kontoyiannis studied the second-order variable length source coding for the Markov chain [44]. In [45], Kontoyiannis and Verdú derived the second-order rate of lossless source coding under the overflow probability criterion.

For channel coding of the i.i.d. case, Scarlett et al. derived a saddle-point approximation, which unifies all three regimes [46,47].

1.6. Organization of the Paper

In Section 2, we introduce the information measures and their properties that will be used in Section 3 and Section 4. Then, source coding with side-information and channel coding is discussed in Section 3 and Section 4, respectively. As we mentioned above, we state our main result in terms of the Rényi entropies, and we use the CGFs and the Gallager function in the proofs. We explain how to cover the continuous case in Remarks 1 and 5. In Appendix A and Appendix B, the relation between the Rényi entropies and corresponding CGFs are summarized. The relation between the Rényi entropies and the Gallager function are explained as necessary. Proofs of some technical results are also provided in the remaining Appendices.

1.7. Notations

For a setX, the set of all distributions onXis denoted byP(X). The set of all sub-normalized non-negative functions onXis denoted byP¯(X). The cumulative distribution function of the standard Gaussian random variable is denoted by:

Φ(t)=_∫−∞t12πexp−^x22dx.

Throughout the paper, the base of the logarithm is the natural base e. 2. Information Measures

Since this paper discusses the second-order tightness, we need to discuss the central limit theorem for the Markov process. For this purpose, we usually employ advanced mathematical methods from probability theory. For example, the paper [48] (Theorem 4) showed the Markov version of the central limit theorem by using a martingale stopping technique. Lalley [49] employed the regular perturbation theory of operators on the infinite-dimensional space [50] (Chapter 7, #1, Chapter 4, #3, and Chapter 3, #5). The papers [51,52] and [53] (Lemma 1.5 of Chapter 1) employed the spectral measure, while it is hard to calculate the spectral measure in general even in the finite-state case. Further, the papers [36,51,54,55] showed the central limit theorem by using the asymptotic variance, but they did not give any computable expression of the asymptotic variance without the infinite sum. In summary, to derive the central limit theorem with the variance of a computable form, these papers needed to use very advanced mathematics beyond calculus and linear algebra.

To overcome the difficulty of the Markov version of the central limit theorem, we employed the method used in our recent paper [23]. The paper [23] employed the method based on the cumulant generating function for transition matrices, which is defined by the Perron eigenvalue of a specific non-negative-entry matrix. Since a Perron eigenvalue can be explained in the framework of linear algebra, the method can be described with elementary mathematics. To employ this method, we need to define the information measure in a way similar to the cumulant generating function for transition matrices. That is, we define the information measures for transition matrices, e.g., the conditional Rényi entropy for transition matrices, etc, by using Perron eigenvalues.

Fortunately, these information measures for transition matrices are very useful even for large deviation-type evaluation and finite-length bounds. For example, our recent paper [23] derived finite-length bounds for simple hypothesis testing for the Markov chain by using the cumulant generating function for transition matrices. Therefore, using these information measures for transition matrices, this paper derives finite-length bounds for source coding and channel coding with Markov chains and discusses their asymptotic bounds with large deviation, moderate deviation, and the second-order type.

Since they are natural extensions of information measures for single-shot setting, we first review information measures for the single-shot setting in Section 2.1. Next, we introduce information measures for transition matrices in Section 2.2. Then, we show that information measures for Markov chains can be approximated by information measures for transition matrices generating those Markov chains in Section 2.3.

2.1. Information Measures for the Single-Shot Setting

In this section, we introduce conditional Rényi entropies for the single-shot setting. For more a detailed review of conditional Rényi entropies, see [21]. For a correlated random variable(X,Y)onX×Ywith probability distribution_PXYand a marginal distribution_QYonY, we introduce the conditional Rényi entropy of order1+θrelative to_QYas:

_H1+θ(_PXY|_QY):=−1θlog∑x,y_PXY ^(x,y)1+θ _QY ^(y)−θ,

whereθ∈(−1,0)∪(0,∞). The conditional Rényi entropy of order zero relative to_QYis defined by the limit with respect toθ. When X has no side-information, it is nothing but the ordinary Rényi entropy, and it is denoted by_H1+θ(X)=_H1+θ(_PX)throughout the paper.

One of the important special cases of_H1+θ(_PXY|_QY)is the case with_QY=_PY, where_PYis the marginal of_PXY. We shall call this special case the lower conditional Rényi entropy of order1+θ and denote ( this notation was first introduced in [56]):

_H1+θ↓(X|Y):=_H1+θ(_PXY|_PY)

=−1θlog∑x,y_PXY ^(x,y)1+θ _PY ^(y)−θ.

When we consider the second-order analysis, the variance of the entropy density plays an important role:

V(X|Y):=Varlog1_PX|Y(X|Y).

We have the following property, which follows from the correspondence between the conditional Rényi entropy and the cumulant generating function (cf. Appendix B).

Lemma 1.

We have:

limθ→0_H1+θ↓(X|Y)=H(X|Y)

and (as seen in the proof (cf. (A26)), the left-hand side of (11) corresponds to the second derivative of the cumulant generating function):

limθ→02H(X|Y)−_H1+θ↓(X|Y)θ=V(X|Y).

Proof.

(10) follows from the relation in (A25) and the fact that the first-order derivative of the cumulant generating function is the expectation. (11) follows from (A25), (10) and (A26). □

The other important special case of_H1+θ(_PXY|_QY)is the measure maximized over_QY. We shall call this special case the upper conditional Rényi entropy of order1+θand denote (Equation (13) for−1<θ<0follows from the Hölder inequality, and Equation (13) for0<θ follows from the reverse Hölder inequality [57] (Lemma 8). Similar optimization has appeared in the context of Rényi mutual information in [58] (see also [59]).):

_H1+θ↑(X|Y):=max_QY∈P(Y)_H1+θ(_PXY|_QY)

=_H1+θ(_PXY|_PY(1+θ))

=−1+θθlog∑y_PY(y)^{∑x_PX|Y ^(x|y)1+θ11+θ},

where:

_PY(1+θ)(y):=^{_∑x _PXY ^(x,y)1+θ11+θ}_∑^y′ ^{_∑x _PXY ^{(x,^y′)1+θ}11+θ}.

For this measure, we also have the same properties as Lemma 1. This lemma will be proven in Appendix C.

Lemma 2.

We have:

limθ→0_H1+θ↑(X|Y)=H(X|Y)

and:

limθ→02H(X|Y)−_H1+θ↑(X|Y)θ=V(X|Y).

When we derive converse bounds, we need to consider the case such that the order of the Rényi entropy is different from the order of conditioning distribution defined in (15). For this purpose, we introduce two-parameter conditional Rényi entropy, which connects the two kinds of conditional Rényi entropies_H1+θ↓(X|Y)and_H1+θ↑(X|Y)in the way as Statements 10 and 11 of Lemma 3:

_{H1+θ,1+^θ′}(X|Y)

:=_H1+θ(_PXY|_PY(1+^θ′))

=−1θlog∑y_PY(y)∑x_PX|Y ^(x|y)1+θ^{∑x_PX|Y ^{(x|y)1+^θ′}−θ1+^θ′}+^θ′1+^θ′_H1+^θ′↑(X|Y).

Next, we investigate some properties of the measures defined above, which will be proven in Appendix D.

Lemma 3.

For fixed_QY,θ_H1+θ(_PXY|_QY)is a concave function of θ, and it is strict concave iffVarlog_QY(Y)_PXY(X,Y)>0.

For fixed_QY,_H1+θ(_PXY|_QY)is a monotonically decreasing (Technically,_H1+θ(_PXY|_QY)is always non-increasing, and it is monotonically decreasing iff strict concavity holds in Statement 1. Similar remarks are also applied for other information measures throughout the paper.) function of θ.

The functionθ_H1+θ↓(X|Y)is a concave function of θ, and it is strict concave iffV(X|Y)>0.

_H1+θ↓(X|Y)is a monotonically decreasing function of θ.

The functionθ_H1+θ↑(X|Y)is a concave function of θ, and it is strict concave iffV(X|Y)>0.

_H1+θ↑(X|Y)is a monotonically decreasing function of θ.

For everyθ∈(−1,0)∪(0,∞), we have_H1+θ↓(X|Y)≤_H1+θ↑(X|Y).

For fixed^θ′, the functionθ_{H1+θ,1+^θ′}(X|Y)is a concave function of θ, and it is strict concave iffV(X|Y)>0.

For fixed^θ′,_{H1+θ,1+^θ′}(X|Y)is a monotonically decreasing function of θ.

10.

We have:

_H1+θ,1(X|Y)=_H1+θ↓(X|Y).

11.

We have:

_H1+θ,1+θ(X|Y)=_H1+θ↑(X|Y).

12.

For everyθ∈(−1,0)∪(0,∞),_{H1+θ,1+^θ′}(X|Y)is maximized at^θ′=θ.

The following lemma expresses explicit forms of the conditional Rényi entropies of order zero.

Lemma 4.

We have:

limθ→−1_H1+θ(_PXY|_QY)=_H0(_PXY|_QY)

:=log∑y_QY(y)|supp(_PX|Y(·|y))|,

limθ→−1_H1+θ↑(X|Y)=_H0↑(X|Y)

:=logmaxy∈supp(_PY)|supp(_PX|Y(·|y))|,

limθ→−1_H1+θ↓(X|Y)=_H0↓(X|Y)

:=log∑y_PY(y)|supp(_PX|Y(·|y))|.

Proof.

See Appendix E. □

The definition (6) guarantees the existence of the derivative ofd[θ_H1+θ(_PXY|_QY)]dθ. From Statement 1 of Lemma 3,d[θ_H1+θ(_PXY|_QY)]/dθis monotonically decreasing. Thus, the inverse function (Throughout the paper, the notationsθ(a)anda(R)are reused for several inverse functions. Although the meanings of those notations are obvious from the context, we occasionally put superscript Q, ↓ or ↑ to emphasize that those inverse functions are induced from corresponding conditional Rényi entropies. This definition is related to the Legendre transform of the concave functionθ↦θ_H1+θ↓(X|Y).) ofθ↦d[θ_H1+θ(_PXY|_QY)]/dθexists so that the functionθ(a)=^θQ(a)is defined as:

d[θ_H1+θ(_PXY|_QY)]dθ_|θ=θ(a)=a

fora̲<a≤a¯, wherea̲=^a̲Q:=_limθ→∞d[θ_H1+θ(_PXY|_QY)]/dθanda¯=^a¯Q:=_limθ→−1d[θ_H1+θ(_PXY|_QY)]/dθ. Let:

R(a)=^RQ(a):=(1+θ(a))a−θ(a)_H1+θ(a)(_PXY|_QY).

Since:

^R′(a)=d^R′(a)da=dθ(a)daa+1+θ(a)−d(θ_H1+θ(_PXY|_QY))dθdθ(a)da=dθ(a)daa+1+θ(a)−adθ(a)da=1+θ(a),

R(a)is a monotonic increasing function ofa̲<a≤R(a¯). Thus, we can define the inverse functiona(R)=^aQ(R)ofR(a)by:

(1+θ(a(R)))a(R)−θ(a(R))_H1+θ(a(R))(_PXY|_QY)=R

forR(a̲)<R≤_H0(_PXY|_QY).

Forθ_H1+θ↓(X|Y), by the same reason as above, we can define the inverse functionsθ(a)=^θ↓(a)anda(R)=^a↓(R)by:

d[θ_H1+θ↓(X|Y)]dθ_|θ=θ(a)=a

and:

(1+θ(a(R)))a(R)−θ(a(R))_{H1+θ(a(R))↓}(X|Y)=R,

forR(a̲)<R≤_H0↓(X|Y). Forθ_H1+θ↑(X|Y), we also introduce the inverse functionsθ(a)=^θ↑(a)anda(R)=^a↑(R)by:

dθ_H1+θ↑(X|Y)dθ_|θ=θ(a)=a

and:

(1+θ(a(R)))a(R)−θ(a(R))_{H1+θ(a(R))↑}(X|Y)=R

forR(a̲)<R≤_H0↑(X|Y).

Remark 1.

Here, we discuss the possibility for extension to the continuous case. Since the entropy in the continuous case diverges, we cannot extend the information quantities to the case whenXis continuous. However, it is possible to extend these quantities to the case whenYis continuous, butXis a discrete finite set. In this case, we prepare a general measure μ (like the Lebesgue measure) onYand probability density function_pYand_qYsuch that the distributions_PYand_QYare given as_pY(y)μ(dy)and_qY(y)μ(dy), respectively. Then, it is sufficient to replace ∑,Q(y), and_PXY(x,y)by_∫Yμ(dy),_PX|Y(x|y)_pY(y), and_qY(y), respectively. Hence, in the n-independent and identically distributed case, these information measures are given as n times the original information measures.

One might consider the information quantities for transition matrices given in the next subsection for this continuous case. However, this is not so easy because it needs a continuous extension of the Perron eigenvalue.

2.2. Information Measures for the Transition Matrix

Let_{{W(x,y|^x′,^y′)}((x,y),(^x′,^y′))∈^(X×Y)2} be an ergodic and irreducible transition matrix. The purpose of this section is to introduce transition matrix counterparts of those measures in Section 2.1. For this purpose, we first need to introduce some assumptions on transition matrices:

Assumption 1

(Non-hidden). We say that a transition matrix W is non-hidden (with respect to Y) if the Y-marginal process is a Markov process, i.e., (The reason for the name “non-hidden” is the following. In general, the Y-marginal process is a hidden Markov process. However, when the condition (37) holds, the Y-marginal process is a Markov process. Hence, we call the condition (37) non-hidden.):

∑xW(x,y|^x′,^y′)=W(y|^y′)

for every^x′∈Xandy,^y′∈Y. This condition is equivalent to the existence of the following decomposition ofW(x,y|^x′,^y′):

W(x,y|^x′,^y′)=W(y|^y′)W(x|^x′,^y′,y).

Assumption 2

(Strongly non-hidden). We say that a transition matrix W is strongly non-hidden (with respect to Y) if, for everyθ∈(−1,∞)andy,^y′∈Y (The reason for the name “strongly non-hidden” is the following. When we compute the upper conditional Rényi entropy rate of the Markov source, the effect of the Y process may propagate infinitely even if it is non-hidden. When (39) holds, the effect of the Y process in the computation of the upper conditional Rényi entropy rate is only one step.):

_Wθ(y|^y′):=∑xW^{(x,y|^x′,^y′)1+θ}

is well defined, i.e., the right-hand side of (39) is independent of^x′.

Assumption 1 requires (39) to hold only forθ=0, and thus, Assumption 2 implies Assumption 1. However, Assumption 2 is a strictly stronger condition than Assumption 1. For example, let us consider the case such that the transition matrix is a product form, i.e.,W(x,y|^x′,^y′)=W(x|^x′)W(y|^y′). In this case, Assumption 1 is obviously satisfied. However, Assumption 2 is not satisfied in general.

Assumption 2 has another expression as follows.

Lemma 5.

Assumption 2 holds if and only if, for every^x′≠^x˜′, there exists a permutation_{π^x′;^x˜′}onXsuch thatW(x|^x′,^y′,y)=W(_{π^x′;^x˜′}(x)|^x˜′,^y′,y).

Proof.

Since the part “if” is trivial, we show the part “only if” as follows. By noting (38), Assumption 2 can be rephrased as:

∑xW^{(x|^x′,^y′,y)1+θ}

does not depend on^x′for everyθ∈(−1,∞). Furthermore, this condition can be rephrased as follows. For^x′≠^x˜′, if the largest values of_{{W(x|^x′,^y′)}x∈X}and_{{W(x|^x˜′,^y′)}x∈X}are different, say the former is larger, then_∑xW^{(x|^x′,^y′)1+θ}>_∑xW^{(x|^x˜′,^y′)1+θ}for sufficiently largeθ , which contradicts the fact that (40) does not depend on^x′. Thus, the largest values of_{{W(x|^x′,^y′)}x∈X}and_{{W(x|^x˜′,^y′)}x∈X}must coincide. By repeating this argument for the second largest value of_{{W(x|^x′,^y′)}x∈X}and_{{W(x|^x˜′,^y′)}x∈X}, and so on, we find that Assumption 2 implies that for every^x′≠^x˜′, there exists a permutation_{π^x′;^x˜′}onXsuch thatW(x|^x′,^y′,y)=W(_{π^x′;^x˜′}(x)|^x˜′,^y′,y). □

Now, we fix an element_x0∈Xand transform a sequence of random numbers(_X1,_Y1,_X2,_Y2,…,_Xn,_Yn)to the sequence of random numbers(_X1′,_Y1′,_X2′,_Y2′,…,_Xn′,_Yn′):=(_X1,_Y1,_{π_x0;_X1−1}(_X2),_Y2,…,_{π_x0;_X1−1}(_Xn),_Yn). Then, letting^W′(x|^y′,y):=W(x|_x0,^y′,y), we have_{P_Xi′,_Yi′|_Xi−1′,_Yi−1′}=^W′(_yi′|_yi−1′)W(_xi′|_yi′,_yi−1′). That is, essentially, the transition matrix of this case can be written by the transition matrixW(_yi′|_yi−1′)^W′(_xi′|_yi′,_yi−1′). Therefore, the transition matrix can be written by using the positive-entry matrix_{W_xi′}(_yi′|_yi−1′):=W(_yi′|_yi−1′)^W′(_xi′|_yi′,_yi−1′).

The following are non-trivial examples satisfying Assumptions 1 and 2.

Example 1.

Suppose thatX=Yis a module (an additive group). Let P and Q be transition matrices onX. Then, the transition matrix given by:

W(x,y|^x′,^y′)=Q(y|^y′)P(x−y|^x′−^y′)

satisfies Assumption 1. Furthermore, if transition matrixP(z|^z′)can be written as:

P(z|^z′)=_PZ(_π^z′(z))

for permutation_π^z′and a distribution_PZonX , then transition matrix W defined by (41) satisfies Assumption 2 as well.

Example 2.

Suppose thatXis a module and W is (strongly) non-hidden with respect toY. Let Q be a transition matrix onZ=X. Then, the transition matrix given by:

V(x,y,z|^x′,^y′,^z′)=W(x−z,y|^x′−^z′,y)Q(z|^z′)

is (strongly) non-hidden with respect toY×Z.

The following is also an example satisfying Assumption 2, which describes a noise process of an important class of channels with memory (cf. the Gilbert-Elliot channel in Example 6).

Example 3.

LetX=Y={0,1}. Then, let:

W(y|^y′)=1−_q^y′ify=^y′_q^y′ify≠^y′

for some0<_q0,_q1<1, and let:

W(x|^x′,^y′,y)=1−_pyifx=0_pyifx=1

for some0<_p0,_p1<1. By choosing_{π^x′;^x˜′}to be the identity, this transition matrix satisfies the condition given in Remark 5, which is equivalent to Assumption 2.

First, we introduce information measures under Assumption 1. In order to define a transition matrix counterpart of (7), let us introduce the following tilted matrix:

_W˜θ(x,y|^x′,^y′):=W^{(x,y|^x′,^y′)1+θ}W^{(y|^y′)−θ}.

Here, we should notice that the tilted matrix_W˜θis not normalized, i.e., is not a transition matrix. Let_λθbe the Perron–Frobenius eigenvalue of_W˜θand_P˜θ,XYbe its normalized eigenvector. Then, we define the lower conditional Rényi entropy for W by:

_H1+θ↓,W(X|Y):=−1θlog_λθ,

whereθ∈(−1,0)∪(0,∞). Forθ=0, we define the lower conditional Rényi entropy for W by:

^HW(X|Y)=_H1↓,W(X|Y)

:=limθ→0_H1+θ↓,W(X|Y),

and we just call it the conditional entropy for W. In fact, the definition of^HW(X|Y)above coincides with:

−∑^x′,^y′_P0,XY(^x′,^y′)∑x,yW(x,y|^x′,^y′)logW(x,y|^x′,^y′)W(y|^y′),

where_P0,XY is the stationary distribution of W (cf. [60] (Equation (30))). Forθ=−1,_H0↓,W(X|Y)is also defined by taking the limit. When X has no side-information, the Rényi entropy_H1+θW(X)for W is defined as a special case of_H1+θ↓,W(X|Y).

As a counterpart of (11), we also define (Since the limiting expression in (51) coincides with the second derivative of the CGF (cf. (A30)) and since the second derivative of the CGF exists (cf. [22] (Appendix D)), the variance in (51) is well defined. While the definition (51) contains the limitθ→0 , it can be calculated without this type of limit by using the fundamental matrix [61] (Theorem 4.3.1), [23] (Theorem 7.7 and Remark 7.8).):

^VW(X|Y):=limθ→02^HW(X|Y)−_H1+θ↓,W(X|Y)θ.

Remark 2.

When transition matrix W satisfies Assumption 2,_H1+θ↓,W(X|Y)can be written as:

_H1+θ↓,W(X|Y)=−1θlog_λθ′,

where_λθ′is the Perron–Frobenius eigenvalue of_Wθ(y|^y′)W^{(y|^y′)−θ}. In fact, for the left Perron–Frobenius eigenvector_Q^θof_Wθ(y|^y′)W^{(y|^y′)−θ}, we have:

∑x,y_Q^θ(y)W^{(x,y|^x′,^y′)1+θ}W^{(y|^y′)−θ}=_λθ′ _Qθ(^y′),

which implies that_λθ′is the Perron–Frobenius eigenvalue of_W˜θ. Consequently, we can evaluate_H1+θ↓,W(X|Y)by calculating the Perron–Frobenius eigenvalue of the|Y|×|Y|matrix instead of the|X||Y|×|X||Y|matrix when W satisfies Assumption 2.

Next, we introduce information measures under Assumption 2. In order to define a transition matrix counterpart of (12), let us introduce the following|Y|×|Y|matrix:

_Kθ(y|^y′):=_Wθ ^{(y|^y′)11+θ},

where_Wθ is defined by (39). Let_κθbe the Perron–Frobenius eigenvalue of_Kθ. Then, we define the upper conditional Rényi entropy for W by:

_H1+θ↑,W(X|Y):=−1+θθlog_κθ,

whereθ∈(−1,0)∪(0,∞). Forθ=−1andθ=0,_H1+θ↑,W(X|Y) is defined by taking the limit. We have the following properties, which will be proven in Appendix F.

Lemma 6.

We have:

limθ→0_H1+θ↑,W(X|Y)=^HW(X|Y)

and:

limθ→02^HW(X|Y)−_H1+θ↑,W(X|Y)θ=^VW(X|Y).

Now, let us introduce a transition matrix counterpart of (18). For this purpose, we introduce the following|Y|×|Y|matrix:

_Nθ,^θ′(y|^y′):=_Wθ(y|^y′)_W^θ′ ^{(y|^y′)−θ1+^θ′}.

Let_νθ,^θ′be the Perron–Frobenius eigenvalue of_Nθ,^θ′. Then, we define the two-parameter conditional Rényi entropy by:

_{H1+θ,1+^θ′W}(X|Y):=−1θlog_νθ,^θ′+^θ′1+^θ′_{H1+^θ′↑,W}(X|Y).

Remark 3.

Although we defined_H1+θ↓,W(X|Y)and_H1+θ↑,W(X|Y) by (47) and (55), respectively, we can alternatively define these measures in the same spirit as the single-shot setting by introducing a transition matrix counterpart of_H1+θ(_PXY|_QY)as follows. For the marginalW(y|^y′)ofW(x,y|^x′,^y′), let_YW2:={(y,^y′):W(y|^y′)>0}. For another transition matrix V onY, we define_YV2in a similar manner. For V satisfying_YW2⊂_YV2, we define (although we can also define_H1+θW|V(X|Y)even if_YW2⊂_YV2 is not satisfied (see [22] for the detail), for our purpose of defining_H1+θ↓,W(X|Y)and_H1+θ↑,W(X|Y), other cases are irrelevant):

_H1+θW|V(X|Y):=−1θlog_λθW|V

forθ∈(−1,0)∪(0,∞), where_λθW|Vis the Perron–Frobenius eigenvalue of:

W^{(x,y|^x′,^y′)1+θ}V^{(y|^y′)−θ}.

By using this measure, we obviously have:

_H1+θ↓,W(X|Y)=_H1+θW|W(X|Y).

Furthermore, under Assumption 2, the relation:

_H1+θ↑,W(X|Y)=maxV_H1+θW|V(X|Y)

holds (see Appendix G for the proof), where the maximum is taken over all transition matrices satisfying_YW2⊂_YV2.

Next, we investigate some properties of the information measures introduced in this section. The following lemma is proven in Appendix H.

Lemma 7.

The functionθ_H1+θ↓,W(X|Y)is a concave function of θ, and it is strict concave iff^VW(X|Y)>0.

_H1+θ↓,W(X|Y)is a monotonically decreasing function of θ.

The functionθ_H1+θ↑,W(X|Y)is a concave function of θ, and it is strict concave iff^VW(X|Y)>0.

_H1+θ↑,W(X|Y)is a monotonically decreasing function of θ.

For everyθ∈(−1,0)∪(0,∞), we have_H1+θ↓,W(X|Y)≤_H1+θ↑,W(X|Y).

For fixed^θ′, the functionθ_{H1+θ,1+^θ′W}(X|Y)is a concave function of θ, and it is strict concave iff^VW(X|Y)>0.

For fixed^θ′,_{H1+θ,1+^θ′W}(X|Y)is a monotonically decreasing function of θ.

We have:

_H1+θ,1W(X|Y)=_H1+θ↓,W(X|Y).

We have:

_H1+θ,1+θW(X|Y)=_H1+θ↑,W(X|Y).

10.

For everyθ∈(−1,0)∪(0,∞),_{H1+θ,1+^θ′W}(X|Y)is maximized at^θ′=θ, i.e.,

d[_{H1+θ,1+^θ′W}(X|Y)]d^θ′_|^θ′=θ=0.

From Statement 1 of Lemma 7,d[θ_H1+θ↓,W(X|Y)]/dθis monotonically decreasing. Thus, we can define the inverse function^θW(a)=^θ↓,W(a)ofd[θ_H1+θ↓,W(X|Y)]/dθby:

d[θ_H1+θ↓,W(X|Y)]dθ_|θ=^θW(a)=a

fora̲<a≤a¯, wherea̲:=_limθ→∞d[θ_H1+θ↓,W(X|Y)]/dθanda¯:=_limθ→−1d[θ_H1+θ↓,W(X|Y)]/dθ. Let:

^RW(a):=(1+θ(a))a−θ(a)_{H1+θ(a)↓,W}(X|Y).

Since

^{^RW′}(a)=(1+θ(a)),

^RW(a)is a monotonic increasing function ofa̲<a<^RW(a¯). Thus, we can define the inverse function^aW(R)=^a↓,W(R)of^RW(a)by:

(1+θ(^aW(R)))^aW(R)−^θW(^aW(R))_{H1+^θW(^aW(R))↓,W}(X|Y)=R

for^RW(a̲)<R<_H0↓,W(X|Y), where_H0↓,W(X|Y):=_limθ→−1 _H1+θ↓,W(X|Y).

Forθ_H1+θ↑,W(X|Y), by the same reason, we can define the inverse function^θW(a)=^θ↑,W(a)by:

d[θ_{H1+θ,1+^θW(a)W}(X|Y)]dθ_|θ=^θW(a)=d[θ_H1+θ↑,W(X|Y)]dθ_|θ=^θW(a)=a,

and the inverse function^aW(R)=^a↑,W(R)of:

^RW(a):=(1+^θW(a))a−^θW(a)_{H1+^θW(a)↑,W}(X|Y)

by:

(1+^θW(^aW(R)))^aW(R)−^θW(^aW(R))_{H1+^θW(^aW(R))↑,W}(X|Y)=R,

forR(a̲)<R<_H0↑,W(X|Y), where_H0↑,W(X|Y):=_limθ→−1 _H1+θ↑,W(X|Y) . Here, the first equality in (71) follows from (66).

Sinceθ↦θ_H1+θ↓,W(X|Y)is concave, the supremum of[−θR+θ_H1+θ↓,W(X|Y)]is attained at the stationary point. Furthermore, note that−1≤^θ↓,W(R)≤0for^HW(X|Y)≤R≤_H0↓,W(X|Y). Thus, we have the following property.

Lemma 8.

The function^θW(R) defined in (67) satisfies:

sup−1≤θ≤0[−θR+θ_H1+θ↓,W(X|Y)]=−^θW(R)R+^θW(R)_{H1+^θW(R)↓,W}(X|Y)

for^HW(X|Y)≤R≤_H0↓,W(X|Y).

Furthermore, we have the following characterization for another type of maximization.

Lemma 9.

The function^θW(^aW(R)) defined by (70) satisfies:

sup−1≤θ≤0−θR+θ_H1+θ↓,W(X|Y)1+θ=−^θW(^aW(R))^aW(R)+^θW(^aW(R))_{H1+^θW(^aW(R))↓,W}(X|Y)

for^HW(X|Y)≤R≤_H0↓,W(X|Y), and the functionθ(a(R)) defined in (73) satisfies:

sup−1≤θ≤0−θR+θ_H1+θ↑,W(X|Y)1+θ=−^θW(^aW(R))^aW(R)+^θW(^aW(R))_{H1+^θW(^aW(R))↑,W}(X|Y)

for^HW(X|Y)≤R≤_H0↑,W(X|Y).

Proof.

See Appendix I. □

Remark 4.

The combination of (49), (51), and Lemma 6 guarantees that both the conditional Rényi entropies expand as:

_H1+θ↓,W(X|Y)=^HW(X|Y)−12^VW(X|Y)θ+o(θ),

_H1+θ↑,W(X|Y)=^HW(X|Y)−12^VW(X|Y)θ+o(θ)

aroundθ=0. Thus, the difference of these measures significantly appears only when|θ|is rather large. For the transition matrix of Example 3 with_q0=_q1=0.1,_p0=0.1, and_p1=0.4 , we plotted the values of the information measures in Figure 1. Although the values atθ=−1 coincide in Figure 1, note that the values atθ=−1may differ in general.

In Example 1, we mentioned that the transition matrix W in (41) satisfies Assumption 2 when transition matrix P is given by (42). By computing the conditional Rényi entropies for this special case, we have:

_H1+θ↑,W(X|Y)=_H1+θ↓,W(X|Y)

=_H1+θ(_PZ),

i.e., the two kinds of conditional Rényi entropies coincide.

Now, let us consider the asymptotic behavior of_H1+θ↓,W(X|Y)aroundθ=0. Whenθ(a)is close to zero, we have:

^θW(a)_{H1+^θW(a)↓,W}(X|Y)=^θW(a)^HW(X|Y)−12^VW(X|Y)^θW ^(a)2+o(^θW ^(a)2).

Taking the derivative, (67) implies that:

a=^HW(X|Y)−^VW(X|Y)^θW(a)+o(^θW(a)).

Hence, when R is close to^HW(X|Y), we have:

R=(1+^θW(^aW(R))^aW(R)−^θW(^aW(R))_{H1+^θW(^aW(R))↓,W}(X|Y)

=^HW(X|Y)−1+^θW(^aW(R))2^θW(^aW(R))^VW(X|Y)+o(^θW(^aW(R)),

i.e.,

^θW(^aW(R))=−R+^HW(X|Y)^VW(X|Y)+oR−^HW(X|Y)^VW(X|Y).

Furthermore, (81) and (82) imply:

−^θW(^aW(R))^aW(R)+^θW(^aW(R))_{H1+^θW(^aW(R))↓,W}(X|Y)

=^VW(X|Y)^θW ^{(^aW(R))2}2+o(^θW ^{(^aW(R))2})

=^VW(X|Y)2^{R−^HW(X|Y)^VW(X|Y)2}+o^{R−^HW(X|Y)^VW(X|Y)2}.

2.3. Information Measures for the Markov Chain

Let(X,Y)be the Markov chain induced by transition matrix W and some initial distribution_{P_X1 _Y1} . Now, we show how information measures introduced in Section 2.2 are related to the conditional Rényi entropy rates. First, we introduce the following lemma, which gives finite upper and lower bounds on the lower conditional Rényi entropy.

Lemma 10.

Suppose that transition matrix W satisfies Assumption 1. Let_vθbe the eigenvector of_WθTwith respect to the Perron–Frobenius eigenvalue_λθsuch that_minx,y _vθ(x,y)=1 (since the eigenvector corresponding to the Perron–Frobenius eigenvalue for an irreducible non-negative matrix has always strictly positive entries [62] (Theorem 8.4.4, p. 508), we can choose the eigenvector_vθsatisfying this condition). Let_wθ(x,y):=_{P_X1 _Y1} ^(x,y)1+θ _{P_Y1} ^(y)−θ. Then, for everyn≥1, we have:

(n−1)θ_H1+θ↓,W(X|Y)+δ̲(θ)≤θ_H1+θ↓(^Xn|^Yn)≤(n−1)θ_H1+θ↓,W(X|Y)+δ¯(θ),

where:

δ¯(θ):=−log〈_vθ|_wθ〉+logmaxx,y_vθ(x,y),

δ̲(θ):=−log〈_vθ|_wθ〉,

and〈_vθ|_wθ〉is defined as_∑x,y _vθ(x,y)_wθ(x,y).

Proof.

This follows from (A29) and Lemma A2. □

From Lemma 10, we have the following.

Theorem 1.

Suppose that transition matrix W satisfies Assumption 1. For any initial distribution, we have (When there is no side-information, (93) reduces to the well-known expression of the entropy rate of the Markov process [39]. Without Assumption 1, it is not clear if (93) holds or not.):

limn→∞1n_H1+θ↓(^Xn|^Yn)=_H1+θ↓,W(X|Y),

limn→∞1nH(^Xn|^Yn)=^HW(X|Y).

We also have the following asymptotic evaluation of the variance, which follows from Lemma A3 in Appendix A.

Theorem 2.

Suppose that transition matrix W satisfies Assumption 1. For any initial distribution, we have:

limn→∞1nV(^Xn|^Yn)=^VW(X|Y).

Theorem 2 is practically important since the limit of the variance can be described by a single-letter characterized quantity. A method to calculate^VW(X|Y) can be found in [23].

Next, we show the lemma that gives the finite upper and lower bounds on the upper conditional Rényi entropy in terms of the upper conditional Rényi entropy for the transition matrix.

Lemma 11.

Suppose that transition matrix W satisfies Assumption 2. Let_vθbe the eigenvector of_KθTwith respect to the Perron–Frobenius eigenvalue_κθsuch that_miny _vθ(y)=1. Let_wθbe the|Y|-dimensional vector defined by:

_wθ(y):=^{∑x_{P_X1 _Y1} ^(x,y)1+θ11+θ}.

Then, we have:

(n−1)θ1+θ_H1+θ↑,W(X|Y)+ξ̲(θ)≤θ1+θ_H1+θ↑(^Xn|^Yn)≤(n−1)θ1+θ_H1+θ↑,W(X|Y)+ξ¯(θ),

where:

ξ¯(θ):=−log〈_vθ|_wθ〉+logmaxy_vθ(y),

ξ̲(θ):=−log〈_vθ|_wθ〉.

Proof.

See Appendix J. □

From Lemma 11, we have the following.

Theorem 3.

Suppose that transition matrix W satisfies Assumption 2. For any initial distribution, we have:

limn→∞1n_H1+θ↑(^Xn|^Yn)=_H1+θ↑,W(X|Y).

Finally, we show the lemma that gives the finite upper and lower bounds on the two-parameter conditional Rényi entropy in terms of the two-parameter conditional Rényi entropy for the transition matrix.

Lemma 12.

Suppose that transition matrix W satisfies Assumption 2. Let_vθ,^θ′be the eigenvector of_Nθ,^θ′Twith respect to the Perron–Frobenius eigenvalue_νθ,^θ′such that_miny _vθ,^θ′(y)=1. Let_wθ,^θ′be the|Y|-dimensional vector defined by:

_wθ,^θ′(y):=∑x_{P_X1 _Y1} ^(x,y)1+θ^{∑x_{P_X1 _Y1} ^{(x,y)1+^θ′}−θ1+^θ′}.

Then, we have:

(n−1)θ_{H1+θ,1+^θ′W}(X|Y)+ζ̲(θ,^θ′)≤θ_{H1+θ,1+^θ′}(^Xn|^Yn)≤(n−1)θ_{H1+θ,1+^θ′W}(X|Y)+ζ¯(θ,^θ′),

where:

ζ¯(θ,^θ′):=−log〈_vθ,^θ′|_wθ,^θ′〉+logmaxy_vθ,^θ′(y)+θξ¯(^θ′),

ζ̲(θ,^θ′):=−log〈_vθ,^θ′|_wθ,^θ′〉+θξ̲(^θ′)

forθ>0and:

ζ¯(θ,^θ′):=−log〈_vθ,^θ′|_wθ,^θ′〉+logmaxy_vθ,^θ′(y)+θξ̲(^θ′),

ζ̲(θ,^θ′):=−log〈_vθ,^θ′|_wθ,^θ′〉+θξ¯(^θ′)

forθ<0

Proof.

By multiplyingθin the definition of_{H1+θ,1+^θ′}(^Xn|^Yn), we have:

θ_{H1+θ,1+^θ′}(^Xn|^Yn)

=−log∑^yn∑^xn_{P^Xn ^Yn} ^{(^xn,^yn)1+θ}^{∑^xn_{P^Xn ^Yn} ^{(^xn,^yn)1+^θ′}−θ1+^θ′}+θ^θ′1+^θ′_H1+^θ′↑(^Xn|^Yn).

The second term is evaluated by Lemma 11. The first term can be evaluated almost in the same manner as Lemma 11. □

From Lemma 12, we have the following.

Theorem 4.

Suppose that transition matrix W satisfies Assumption 2. For any initial distribution, we have:

limn→∞1n_{H1+θ,1+^θ′}(^Xn|^Yn)=_{H1+θ,1+^θ′W}(X|Y).

3. Source Coding with Full Side-Information

In this section, we investigate source coding with side-information. We start this section by showing the problem setting in Section 3.1. Then, we review and introduce some single-shot bounds in Section 3.2. We derive finite-length bounds for the Markov chain in Section 3.3. Then, in Section 3.5 and Section 3.6, we show the asymptotic characterization for the large deviation regime and the moderate deviation regime by using those finite-length bounds. We also derive the second-order rate in Section 3.4.

3.1. Problem Formulation

A codeΨ=(e,d)consists of one encodere:X→{1,…,M}and one decoderd:{1,…,M}×Y→X. The decoding error probability is defined by:

_Ps[Ψ]=_Ps[Ψ|_PXY]

:=Pr{X≠d(e(X),Y)}.

For notational convenience, we introduce the infimum of error probabilities under the condition that the message size is M:

_Ps(M)=_Ps(M|_PXY)

:=infΨ_Ps[Ψ].

For theoretical simplicity, we focus on a randomized choice of our encoder. For this purpose, we employ a randomized hash function F fromXto{1,…,M}. A randomized hash function F is called a two-universal hash whenPr{F(x)=F(^x′)}≤1Mfor any distinctive x and^x′ [63]; the so-called bin coding [39] is an example of the two-universal hash function. In the following, we denote the set of two-universal hash functions byF. Given an encoder f as a function fromXto{1,…,M}, we define the decoder_dfas the optimal decoder byargmind_Ps[(f,d)]. Then, we denote the code(f,_df)byΨ(f). Then, we bound the error probability_Ps[Ψ(F)]averaged over the random function F by only using the property of two-universality. In order to consider the worst case of such schemes, we introduce the following quantity:

_P¯s(M)=_P¯s(M|_PXY)

:=supF∈F_EF[_Ps[Ψ(F)]],.

When we consider n-fold extension, the source code and related quantities are denoted with the superscript(n) . For example, the quantities in (112) and (114) are written as_Ps(n)(M)and_P¯s(n)(M), respectively. Instead of evaluating them, we are often interested in evaluating:

M(n,ε):=inf{_Mn:_Ps(n)(_Mn)≤ε},

M¯(n,ε):=inf{_Mn:_P¯s(n)(_Mn)≤ε}

for given0≤ε<1.

3.2. Single-Shot Bounds

In this section, we review existing single-shot bounds and also show novel converse bounds. For the information measures used below, see Section 2.

By using the standard argument on information-spectrum approach, we have the following achievability bound.

Lemma 13

(Lemma 7.2.1 of [3]). The following bound holds:

_P¯s(M)≤infγ≥0_PXYlog1_PX|Y(x|y)>γ+^eγM.

Although Lemma 13 is useful for the second-order regime, it is known to be not tight in the large deviation regime. By using the large deviation technique of Gallager, we have the following exponential-type achievability bound.

Lemma 14

([64]). The following bound holds: (note that the Gallager function and the upper conditional Rényi entropy are related by (A45)):

_P¯s(M)≤inf−12≤θ≤0^Mθ1+θ ^{e−θ1+θ_H1+θ↑(X|Y)}.

Although Lemma 14 is known to be tight in the large deviation regime for i.i.d. sources,_H1+θ↑(X|Y)for Markov chains can only be evaluated under the strongly non-hidden assumption. For this reason, even though the following bound is looser than Lemma 14, it is useful to have another bound in terms of_H1+θ↓(X|Y), which can be evaluated for Markov chains under the non-hidden assumption.

Lemma 15.

The following bound holds:

_P¯s(M)≤inf−1≤θ≤0^Mθ ^{e−θ_H1+θ↓(X|Y)}.

Proof.

To derive this bound, we change the variable in (118) asθ=^θ′1−^θ′. Then,−1≤^θ′≤0, and we have:

^{M^θ′} ^{e−^θ′ _{H11−^θ′↑}(X|Y)}≤^{M^θ′} ^{e−^θ′ _H1+^θ′↓(X|Y)},

where we use Lemma A4 in Appendix C. □

For the source coding without side-information, i.e., when X has no side-information, we have the following bound, which is tighter than Lemma 14.

Lemma 16

((2.39) [65]). The following bound holds:

_Ps(M)≤inf−1<θ≤0^Mθ1+θ ^{e−θ1+θ_H1+θ(X)}.

For the converse part, we first have the following bound, which is very close to the operational definition of source coding with side-information.

Lemma 17

([66]). Let_{{_Ωy}y∈Y}be a family of subsets_Ωy⊂X, and letΩ=_∪y∈Y _Ωy×{y}. Then, for any_QY∈P(Y), the following bound holds:

_Ps(M)≥min{_Ωy}_PXY(^Ωc):∑y_QY(y)|_Ωy|≤M.

Since Lemma 17 is close to the operational definition, it is not easy to evaluate Lemma 17. Thus, we derive another bound by loosening Lemma 17, which is more tractable for evaluation. Slightly weakening Lemma 17, we have the following.

Lemma 18

([3,4]). For any_QY∈P(Y), we have (In fact, a special case for_QY=_PY corresponds to Lemma 7.2.2 of [3]. A bound that involves_QY was introduced in [4] for channel coding, and it can be regarded as a source coding counterpart of that result.):

_Ps(M)≥supγ≥0_PXYlog_QY(y)_PXY(x,y)>γ−M^eγ.

By using the change-of-measure argument, we also obtain the following converse bound.

Theorem 5.

For any_QY∈P(Y), we have:

−log_Ps(M)≤infs>0θ˜∈R,ϑ≥0[(1+s)θ˜_H1+θ˜(_PXY|_QY)−_H1+(1+s)θ˜(_PXY|_QY)

−(1+s)log1−2^{e−−ϑR+(θ˜+ϑ(1+θ˜))_{Hθ˜+ϑ(1+θ˜)}(_PXY|_QY)−(1+ϑ)θ˜_H1+θ˜(_PXY|_QY)1+ϑ}]/s≤infs>0−1<θ˜<θ(a(R))[(1+s)θ˜_H1+θ˜(_PXY|_QY)−_H1+(1+s)θ˜(_PXY|_QY)

−(1+s)log1−2^{e(θ(a(R))−θ˜)a(R)−θ(a(R))_H1+θ(a(R))(_PXY|_QY)+θ˜_H1+θ˜(_PXY|_QY)}]/s,

whereR=logM, andθ(a)=^θQ(a)anda(R)=^aQ(R) are the inverse functions defined in (29) and (32), respectively.

Proof.

See Appendix K. □

In particular, by taking_QY=_{PY(1+θ(a(R)))}in Theorem 5, we have the following.

Corollary 1.

We have:

−log_Ps(M)≤infs>0−1<θ˜<θ(a(R))[(1+s)θ˜_{H1+θ˜,1+θ(a(R))}(X|Y)−_{H1+(1+s)θ˜,1+θ(a(R))}(X|Y)

−(1+s)log1−2^{e(θ(a(R))−θ˜)a(R)−θ(a(R))_{H1+θ(a(R))↑}(X|Y)+θ˜_{H1+θ˜,1+θ(a(R))}(X|Y)}]/s,

whereθ(a)=^θ↑(a)anda(R)=^a↑(R) are the inverse functions defined in (35) and (36).

Remark 5.

Here, we discuss the possibility for extension to the continuous case. As explained in Remark 1, we can define the information quantities for the case whenYis continuous, butXis a discrete finite set. The discussions in this subsection still hold even in this continuous case. In particular, in the n-i.i.d. extension case with this continuous setting, Lemma 14 and Corollary 1 hold when the information measures are replaced by n times the single-shot information measures.

3.3. Finite-Length Bounds for Markov Source In this subsection, we derive several finite-length bounds for the Markov source with a computable form. Unfortunately, it is not easy to evaluate how tight those bounds are only with their formula. Their tightness will be discussed by considering the asymptotic limit in the remaining subsections of this section. Since we assume the irreducibility for the transition matrix describing the Markov chain, the following bound holds with any initial distribution.

To derive a lower bound on−log_P¯s(_Mn)in terms of the Rényi entropy of the transition matrix, we substitute the formula for the Rényi entropy given in Lemma 10 into Lemma 15. Then, we can derive the following achievability bound.

Theorem 6

(Direct, Ass. 1). Suppose that transition matrix W satisfies Assumption 1. LetR:=1nlog_Mn. Then, for everyn≥1, we have:

−log_P¯s(n)(_Mn)≥sup−1≤θ≤0−θnR+(n−1)θ_H1+θ↓,W(X|Y)+δ̲(θ),

whereδ̲(θ)is given by (91).

For the source coding without side-information, from Lemma 16 and a special case of Lemma 10, we have the following achievability bound.

Theorem 7

(Direct, no-side-information). LetR:=1nlog_Mn. Then, for everyn≥1, we have:

−log_Pe(n)(_Mn)≥sup−1<θ≤0−nθR+(n−1)θ_H1+θW(X)+δ̲(θ)1+θ.

To derive an upper bound on−log_Ps(_Mn)in terms of the Rényi entropy of transition matrix, we substitute the formula for the Rényi entropy given in Lemma 10 for Theorem 5. Then, we have the following converse bound.

Theorem 8

(Converse, Ass. 1). Suppose that transition matrix W satisfies Assumption 1. LetR:=1nlog_Mn. For any^HW(X|Y)<R<_H0↓,W(X|Y), we have:

−log_Ps(n)(_Mn)≤infs>0−1<θ˜<θ(a(R))[(n−1)(1+s)θ˜_H1+θ˜↓,W(X|Y)−_{H1+(1+s)θ˜↓,W}(X|Y)+_δ1

−(1+s)log1−2^{e(n−1)[(^θW(^aW(R))−θ˜)^aW(R)−^θW(^aW(R))_{H1+^θW(^aW(R))↓,W}(X|Y)+θ˜_H1+θ˜↓,W(X|Y)]+_δ2}]/s,

whereθ(a)=^θ↓(a)anda(R)=^a↓(R) are the inverse functions defined by (67) and (70), respectively,

_δ1:=(1+s)δ¯(θ˜)−δ̲((1+s)θ˜),

_δ2:=(^θW(^aW(R))−θ˜)R−(1+θ˜)δ̲(^θW(^aW(R)))+(1+^θW(^aW(R)))δ¯(θ˜)1+^θW(^aW(R)),

andδ¯(·)andδ̲(·) are given by (90) and (91), respectively.

Proof.

We first use (124) of Theorem 5 for_Q^Yn=_P^Ynand Lemma 10. Then, we restrict the range ofθ˜as−1<θ˜<^θW(^aW(R))and setϑ=^θW(^aW(R))−θ˜1+θ˜. Then, we have the assertion of the theorem. □

Next, we derive tighter bounds under Assumption 2. To derive a lower bound on−log_P¯s(_Mn)in terms of the Rényi entropy of the transition matrix, we substitute the formula for the Rényi entropy in Lemma 11 for Lemma 14. Then, we have the following achievability bound.

Theorem 9

(Direct, Ass. 2). Suppose that transition matrix W satisfies Assumption 2. LetR:=1nlog_Mn. Then, we have:

−log_P¯s(n)(_Mn)≥sup−12≤θ≤0−θnR+(n−1)θ_H1+θ↑,W(X|Y)1+θ+ξ̲(θ),

whereξ̲(θ)is given by (98).

Finally, to derive an upper bound on−log_Ps(_Mn)in terms of the Rényi entropy for the transition matrix, we substitute the formula for the Rényi entropy in Lemma 12 for Theorem 5 for_Q^Yn=_{P^Yn(1+^θW(^aW(R)))}. Then, we can derive the following converse bound.

Theorem 10

(Converse, Ass. 2). Suppose that transition matrix W satisfies Assumption 2. LetR:=1nlog_Mn. For any^HW(X|Y)<R<_H0↑,W(X|Y), we have:

−log_Ps(n)(_Mn)≤infs>0−1<θ˜<^θW(^aW(R))[(n−1)(1+s)θ˜_{H1+θ˜,1+^θW(^aW(R))W}(X|Y)−_{H1+(1+s)θ˜,1+^θW(^aW(R))W}(X|Y)+_δ1

−(1+s)log1−2^{e(n−1)[(^θW(^aW(R))−θ˜)^aW(R)−^θW(^aW(R))_{H1+^θW(^aW(R))↑,W}(X|Y)+θ˜_{H1+θ˜,1+^θW(^aW(R))W}(X|Y)]+_δ2}]/s,

where^θW(a)=^θ↑,W(a)and^aW(R)=^a↑,W(R) are the inverse functions defined by (71) and (73), respectively,

_δ1:=(1+s)ζ¯(θ˜,^θW(^aW(R)))−ζ̲((1+s)θ˜,^θW(^aW(R))),

_δ2:=(^θW(^aW(R))−θ˜)R−(1+θ˜)ζ̲(^θW(^aW(R)),^θW(^aW(R)))+(1+^θW(^aW(R)))ζ¯(θ˜,^θW(^aW(R)))1+^θW(^aW(R)),

andζ¯(·,·)andζ̲(·,·) are given by (102)–(105).

Proof.

We first use (124) of Theorem 5 for_Q^Yn=_{P^Yn(1+^θW(^aW(R)))}and Lemma 12. Then, we restrict the range ofθ˜as−1<θ˜<^θW(^aW(R))and setϑ=^θW(^aW(R))−θ˜1+θ˜. Then, we have the assertion of the theorem. □

3.4. Second-Order

By applying the central limit theorem to Lemma 13 (cf. [67] (Theorem 27.4, Example 27.6)) and Lemma 18 for_QY=_PYand by using Theorem 2, we have the following.

Theorem 11.

Suppose that transition matrix W onX×Ysatisfies Assumption 1. For arbitraryε∈(0,1), we have:

logM(n,ε)=logM¯(n,ε)+o(n)=n^HW(X|Y)+^VW(X|Y)nΦ(1−ε)+o(n).

Proof.

The central limit theorem for the Markov process cf. [67] (Theorem 27.4, Example 27.6) guarantees that the random variable(−log_P^Xn|^Yn(^Xn|^Yn)−n^HW(X|Y))/nasymptotically obeys the normal distribution with average zero and variance^VW(X|Y), where we use Theorem 2 to show that the limit of the variance is given by^VW(X|Y). LetR=^VW(X|Y)^Φ−1(1−ε). SubstitutingM=^{en^HW(X|Y)+nR}andγ=n^HW(X|Y)+nR−ⁿ¹⁴in Lemma 13, we have:

limn→∞_P¯s(n)^{en^HW(X|Y)+nR}≤ε.

□

On the other hand, substitutingM=^{en^HW(X|Y)+nR}andγ=n^HW(X|Y)+nR+ⁿ¹⁴in Lemma 18 for_QY=_PY, we have:

limn→∞_Ps(n)^{en^HW(X|Y)+nR}≥ε.

Combining (140) and (141), we have the statement of the theorem.

From the above theorem, the (first-order) compression limit of source coding with side-information for a Markov source under Assumption 1 is given by (although the compression limit of source coding with side-information for a Markov chain is known more generally [68], we need Assumption 1 to get a single-letter characterization):

limn→∞1nlogM(n,ε)=limn→∞1nlogM¯(n,ε)

=^HW(X|Y)

for anyε∈(0,1). In the next subsections, we consider the asymptotic behavior of the error probability when the rate is larger than the compression limit^HW(X|Y)in the moderate deviation regime and the large deviation regime, respectively.

3.5. Moderate Deviation From Theorems 6 and 8, we have the following.

Theorem 12.

Suppose that transition matrix W satisfies Assumption 1. For arbitraryt∈(0,1/2)andδ>0, we have:

limn→∞−1^n1−2tlog_Ps(n)^{en^HW(X|Y)+^n1−tδ}=limn→∞−1^n1−2tlog_P¯s(n)^{en^HW(X|Y)+^n1−tδ}

=^δ22^VW(X|Y).

Proof.

We apply Theorems 6 and 8 to the case withR=^HW(X|Y)+^n−tδ, i.e.,θ(a(R))=−ⁿ⁻¹δ^VW(X|Y)+o(^n−t) . For the achievability part, from (88) and Theorem 6, we have:

−log_Ps(n)_Mn≥sup−1≤θ≤0−θnR+(n−1)θ_H1+θ↓,W(X|Y)+inf−1≤θ≤0δ̲(θ)

≥^n1−2t^δ22^VW(X|Y)+o(^n1−2t).

To prove the converse part, we fix arbitrarys>0and chooseθ˜to be−^n−tδ^VW(X|Y)+^n−2t. Then, Theorem 8 implies that:

lim supn→∞−1^n1−2tlog_Ps(_Mn)≤lim supn→∞^n2t1+ssθ˜_H1+θ˜↓,W(X|Y)−_{H1+(1+s)θ˜↓,W}(X|Y)

=lim supn→∞^n2t1+sss^θ˜2d_H1+θ↓,W(X|Y)dθ_|θ=θ˜

=(1+s)^δ22^VW(X|Y).

□

Remark 6.

In the literature [13,69], the moderate deviation results are stated for_ϵnsuch that_ϵn→0andn_ϵn2→∞instead of^n−tfort∈(0,1/2). Although the former is slightly more general than the latter, we employ the latter formulation in Theorem 12 since the order of convergence is clearer. In fact,^n−tin Theorem 12 can be replaced by general_ϵnwithout modifying the argument of the proof.

3.6. Large Deviation From Theorems 6 and 8, we have the following.

Theorem 13.

Suppose that transition matrix W satisfies Assumption 1. For^HW(X|Y)<R, we have:

lim infn→∞−1nlog_P¯s(n)^enR≥sup−1≤θ≤0[−θR+θ_H1+θ↓,W(X|Y)].

On the other hand, for^HW(X|Y)<R<_H0↓,W(X|Y), we have:

lim supn→∞−1nlog_Ps(n)^enR≤−θ(a(R))a(R)+θ(a(R))_{H1+θ(a(R))↓,W}(X|Y)

=sup−1<θ≤0−θR+θ_H1+θ↓,W(X|Y)1+θ.

Proof.

The achievability bound (151) follows from Theorem 6. The converse part (152) is proven from Theorem 8 as follows. We first fixs>0and−1<θ˜<θ(a(R)). Then, Theorem 8 implies:

lim supn→∞−1nlog_Ps(n)^enR≤1+ssθ˜_H1+θ˜↓,W(X|Y)−_{H1+(1+s)θ˜↓,W}(X|Y).

By taking the limits→0andθ˜→θ(a(R)), we have:

1+ssθ˜_H1+θ˜↓,W(X|Y)−_{H1+(1+s)θ˜↓,W}(X|Y)

=1sθ˜_H1+θ˜↓,W(X|Y)−(1+s)θ˜_{H1+(1+s)θ˜↓,W}(X|Y)+θ˜_H1+θ˜↓,W(X|Y)

→−θ˜d[θ_H1+θ↓,W(X|Y)]dθ_|θ=θ˜+θ˜_H1+θ˜↓,W(X|Y)(ass→0)

→−θ(a(R))d[θ_H1+θ↓,W(X|Y)]dθ_|θ=θ(a(R))+θ(a(R))_{H1+θ(a(R))↓,W}(X|Y)(asθ˜→θ(a(R)))

=−θ(a(R))a(R)+θ(a(R))_{H1+θ(a(R))↓,W}(X|Y).

Thus, (152) is proven. The alternative expression (153) is derived via Lemma 9. □

Under Assumption 2, from Theorems 9 and 10, we have the following tighter bound.

Theorem 14.

Suppose that transition matrix W satisfies Assumption 2. For^HW(X|Y)<R, we have:

lim infn→∞−1nlog_P¯s(n)^enR≥sup−12≤θ≤0−θR+θ_H1+θ↑,W(X|Y)1+θ.

On the other hand, for^HW(X|Y)<R<_H0↑,W(X|Y), we have:

lim supn→∞−1nlog_Ps(n)^enR≤−θ(a(R))a(R)+θ(a(R))_{H1+θ(a(R))↑,W}(X|Y)

=sup−1<θ≤0−θR+θ_H1+θ↑,W(X|Y)1+θ.

Proof.

The achievability bound (160) follows from Theorem 9. The converse part (161) is proven from Theorem 10 as follows. We first fixs>0and−1<θ˜<θ(a(R)). Then, Theorem 10 implies:

lim supn→∞−1nlog_Ps(n)^enR≤1+ssθ˜_{H1+θ˜,1+θ(a(R))W}(X|Y)−_{H1+(1+s)θ˜,1+θ(a(R))W}(X|Y).

By taking the limits→0andθ˜→θ(a(R)), we have:

1+ssθ˜_{H1+θ˜,1+θ(a(R))W}(X|Y)−_{H1+(1+s)θ˜,1+θ(a(R))W}(X|Y)

=1sθ˜_{H1+θ˜,1+θ(a(R))W}(X|Y)−(1+s)θ˜_{H1+(1+s)θ˜,1+θ(a(R))W}(X|Y)+θ˜_{H1+θ˜,1+θ(a(R))W}(X|Y)

→−θ˜d[θ_{H1+θ,1+θ(a(R))W}(X|Y)]dθ_|θ=θ˜+θ˜_{H1+θ˜,1+θ(a(R))W}(X|Y)(ass→0)

→−θ(a(R))d[θ_{H1+θ,1+θ(a(R))W}(X|Y)]dθ_|θ=θ(a(R))+θ(a(R))_{H1+θ(a(R))↑,W}(X|Y)(asθ˜→θ(a(R)))

=−θ(a(R))a(R)+θ(a(R))_{H1+θ(a(R))↑,W}(X|Y).

Thus, (161) is proven. The alternative expression (162) is derived via Lemma 9.□

Remark 7.

ForR≤_Rcr , where (cf. (72) for the definition ofR(a)):

_Rcr:=Rd[θ_H1+θ↑,W(X|Y)]dθ_|θ=−12

is the critical rate, the left-hand side of (76) in Lemma 9 is attained by parameters in the range−1/2≤θ≤0 . Thus, the lower bound in (160) is rewritten as:

sup−12≤θ≤0−θR+θ_H1+θ↑,W(X|Y)1+θ=−θ(a(R))a(R)+θ(a(R))_{H1+θ(a(R))↑,W}(X|Y).

Thus, the lower bound and the upper bounds coincide up to the critical rate.

Remark 8.

For the source coding without side-information, by taking the limit of Theorem 7, we have:

lim infn→∞−1nlog_P¯s(n)^enR≥sup−1≤θ≤0−θR+θ_H1+θW(X)1+θ.

On the other hand, as a special case of (152) without side-information, we have:

lim supn→∞−1nlog_Ps(n)^enR≤sup−1<θ≤0−θR+θ_H1+θW(X)1+θ

for^HW(X)<R<_H0W(X) . Thus, we can recover the results in [40,41] by our approach.

3.7. Numerical Example

In this section, to demonstrate the advantage of our finite-length bound, we numerically evaluate the achievability bound in Theorem 7 and a special case of the converse bound in Theorem 8 for the source coding without side-information. Thanks to the aspect (A2), our numerical calculation shows that our upper finite-length bounds are very close to our lower finite-length bounds when the size n is sufficiently large. Thanks to the aspect (A1), we could calculate both bounds with the huge sizen=1×¹⁰⁵because the calculation complexity behaves asO(1).

We consider a binary transition matrix W given by Figure 2, i.e.,

W=1−pqp1−q.

In this case, the stationary distribution is:

P˜(0)=qp+q,

P˜(1)=pp+q.

The entropy is:

^HW(X)=qp+qh(p)+pp+qh(q),

whereh(·)is the binary entropy function. The tilted transition matrix is:

_Wθ=^(1−p)1+θ^q1+θ^p1+θ^(1−q)1+θ.

The Perron–Frobenius eigenvalue is:

_λθ=^(1−p)1+θ+^(1−q)1+θ+^{{^(1−p)1+θ−^(1−q)1+θ}2}+4^p1+θ ^q1+θ2

and its normalized eigenvector is:

_P˜θ(0)=^q1+θ_λθ−^(1−p)1+θ+^q1+θ,

_P˜θ(1)=_λθ−^(1−p)1+θ_λθ−^(1−p)1+θ+^q1+θ.

The normalized eigenvector of_WρTis also given by:

_P^θ(0)=^p1+θ_λθ−^(1−p)1+θ+^p1+θ,

_P^θ(1)=_λθ−^(1−p)1+θ_λθ−^(1−p)1+θ+^p1+θ.

From these calculations, we can evaluate the bounds in Theorems 7 and 8. Forp=0.1,q=0.2 , the bounds are plotted in Figure 3 for fixed error probabilityε=¹⁰⁻³ . Although there is a gap between the achievability bound and the converse bound for rather small n, the gap is less than approximately 5% of the entropy rate for n larger than 10,000. We also plot the bounds in Figure 4 for fixed block lengthn=10,000 and varyingε. The gap between the achievability bound and the converse bound remains approximately 5% of the entropy rate even forεas small as¹⁰⁻¹⁰.

The gap between the achievability bound and the converse bound in Figure 3 is rather large compared to a similar numerical experiment conducted in [1]. One reason for the gap is that our bounds are exponential-type bounds. For instance, when the source is i.i.d., the achievability bound essentially reduces to the so-called Gallager bound [64]. However, an advantage of our bounds is that the computational complexity does not depend on the blocklength. The computational complexities of the bounds plotted in [1] depend the blocklength, and numerical computation of those bounds for Markov sources seems to be difficult.

Whenp=q, an alternative approach to derive tighter bounds is to consider encoding of the Markov transition, i.e.,1[_Xi=_Xi+1] , instead of the source itself (cf. [45] (Example 4)). Then, the analysis can be reduced to i.i.d. case. However, such an approach is possible only whenp=q.

3.8. Summary of the Results

The obtained results in this section are summarized in Table 2. The check marks ?? indicate that the tight asymptotic bounds (large deviation, moderate deviation, and second-order) can be obtained from those bounds. The marks ??^*indicate that the large deviation bound can be derived up to the critical rate. The computational complexity “Tail” indicates that the computational complexities of those bounds depend on the computational complexities of tail probabilities. It should be noted that Theorem 8 is derived from a special case (_QY=_PY) of Theorem 5. The asymptotically optimal choice is_QY=_PY(1+θ), which corresponds to Corollary 1. Under Assumption 1, we can derive the bound of the Markov case only for that special choice of_QY, while under Assumption 2, we can derive the bound of the Markov case for the optimal choice of_QY.

4. Channel Coding

In this section, we investigate the channel coding with a conditional additive channel. The first part of this section discusses the general properties of the channel coding with a conditional additive channel. The second part of this section discusses the properties of the channel coding when the conditional additive noise of the channel is Markov. The first part starts with showing the problem setting in Section 4.1 by introducing a conditional additive channel. Section 4.2 gives a canonical method to convert a regular channel to a conditional additive channel. Section 4.3 gives a method to convert a BPSK-AWGN channel to a conditional additive channel. Then, we show some single-shot achievability bounds in Section 4.4 and single-shot converse bounds in Section 4.5.

As the second part, we derive finite-length bounds for the Markov noise channel in Section 4.6. Then, we derive the second-order rate in Section 4.7. In Section 4.8 and Section 4.9, we show the asymptotic characterization for the large deviation regime and the moderate deviation regime by using those finite-length bounds.

4.1. Formulation for the Conditional Additive Channel

4.1.1. Single-Shot Case

We first present the problem formulation in the single-shot setting. For a channel_PB|A(b|a)with input alphabetAand output alphabetB, a channel codeΨ=(e,d)consists of one encodere:{1,…,M}→Aand one decoderd:B→{1,…,M}. The average decoding error probability is defined by:

_Pc[Ψ]:=∑m=1M1M_PB|A({b:d(b)≠m}|e(m)).

For notational convenience, we introduce the error probability under the condition that the message size is M:

_Pc(M):=infΨ_Pc[Ψ].

Assume that the input alphabetAis the same set as the output alphabetBand they equal an additive groupX. When the transition matrix_PB|A(b|a)is given as_PX(b−a)by using a distribution_PXonX, the channel is called additive.

To extend the concept of the additive channel, we consider the case when the input alphabetAis an additive groupXand the output alphabetBis the product setX×Y. When the transition matrix_PB|A(x,y|a)is given as_PXY(x−a,y)by using a distribution_PXYonX×Y , the channel is called conditional additive. In this paper, we are exclusively interested in the conditional additive channel. As explained in Section 4.2, a channel is a conditional additive channel if and only if it is a regular channel in the sense of [31]. When we need to express the underlying distribution of the noise explicitly, we denote the average decoding error probability by_Pc[Ψ|_PXY].

4.1.2. n-Fold Extension

When we consider n-fold extension, the channel code is denoted with subscript n such as_Ψn=(_en,_dn) . The error probabilities given in (183) and (184) are written with the superscript(n)as_Pc(n)[_Ψn]and_Pc(n)(_Mn), respectively. Instead of evaluating the error probability_Pc(n)(_Mn)for given_Mn, we are also interested in evaluating:

M(n,ε):=sup_Mn:_Pc(n)(_Mn)≤ε

for given0≤ε≤1.

When the channel is given as a conditional distribution, the channel is given by:

_P^Bn|^An(^xn,^yn|^an)=_{P^Xn ^Yn}(^xn−^an,^yn),

where_{P^Xn ^Yn}is a noise distribution on^Xn×^Yn.

For the code construction, we investigate the linear code. For an(n,k)linear code_Cn⊂^An, there exists a parity check matrix_fn:^An→^An−ksuch that the kernel of_fnis_Cn. That is, given a parity check matrix_fn:^An→^An−k, we define the encoder_{IKer(_fn)}:_Cn→^Anas the imbedding of the kernelKer(_fn). Then, using the decoder_{d_fn}:=argmind_Pc[(_{IKer(_fn)},d)], we defineΨ(_fn)=(_{IKer(_fn)},_{d_fn}).

Here, we employ a randomized choice of a parity check matrix. In particular, instead of a two-universal hash function, we focus on linear two-universal hash functions, because the linearity is required in the above relation with source coding. Therefore, denoting the set of linear two-universal hash functions from^Anto^An−kby_Fl, we introduce the quantity:

_P¯c(n,k):=sup_Fn∈_Fl_{E_Fn}_Pc(n)[Ψ(_Fn)].

Taking the infimum over all linear codes associated with_Fn (cf. (113)), we obviously have:

_Pc(n) ^(|A|k)≤_P¯c(n,k).

When we consider the error probability for conditionally additive channels, we use notation_P¯c(n,k|_PXY)so that the underlying distribution of the noise is explicit. We are also interested in characterizing:

k(n,ε):=supk:_P¯c(n,k)≤ε

for given0≤ε≤1.

4.2. Conversion from the Regular Channel to the Conditional Additive Channel The aim of this subsection is to show the following theorem by presenting the conversion rule between these two types of channels. Then, we see that a binary erasure symmetric channel is an example of a regular channel.

Theorem 15.

A channel is a regular channel in the sense of [31] if and only if it can be written as a conditional additive channel.

To show the conversion from a conditional additive channel to a regular channel, we assume that the input alphabetAhas an additive group structure. Let_PX˜be a distribution on the output alphabetB. Let_πabe a representation of the groupAonB, and letG={_πa:a∈A} . A regular channel [31] is defined by:

_PB|A(b|a)=_PX˜(_πa(b)).

The group action induces orbit:

Orb(b):={_πa(b):a∈A}.

The set of all orbits constitutes a disjoint partition ofB. A set of the orbits is denoted byB¯, and letOrb:B→B¯be the map to the representatives.

Example 4

(Binary erasure symmetric channel). LetA={0,1},B={0,1,?}, and:

_PX˜(b)=1−p−^p′ifb=0pifb=1^p′ifb=?.

Then, let:

_π0=01?01?,_π1=01?10?.

The channel defined in this way is a regular channel (see Figure 5). In this case, there are two orbits:{0,1}and{?}.

LetB=X×Yand_PX˜=_PXYfor some joint distribution onX×Y. Now, we consider a conditional additive channel, whose transition matrix_PB|A(x,y|a)is given as_PXY(x−a,y). When the group action is given by_πa(x,y)=(x−a,y), the above conditional additive channel is given as a regular channel. In this case, there are|Y|orbits, and the size of each orbit is|X|, respectively. This fact shows that any conditional additive channel is written as a regular channel. That is, it shows the “if” part of Theorem 15.

Conversely, we present the conversion from a regular channel to a conditional additive channel. We first explain the construction for the single-shot channel. For random variableX˜∼_PX˜, letY=B¯andY=ϖ(X˜)be the random variable describing the representatives of the orbits. Fory=Orb(b)and each orbitOrb(b), we fix an element_0y∈Orb(b). Then, we define:

_PY(y):=_PX˜(Orb(b)),_PX,Y(a,y):=_PX˜(_πa(_0y))|{^a′∈A|_πa(_0y)=_π^a′(_0y)}|.

_PX,Y|B(a,y|b)=1|{^a′∈A|_πa(_0y)=_π^a′(_0y)}|whenb=_πa(_0y)0otherwise.

_PX,Y|B(a,y|b)=1whenb=_πa(_0y)0otherwise,

we obtain the relations:

_PB|A(b|a)=∑x,y_PB|X,Y(b|x,y)_PX,Y|A(x,y|a),_PX,Y|A(x,y|a)=∑b_PX,Y|B(x,y|b)_PB|A(b|a).

These two equations show that the receiver information of the virtual conditional additive channel_PX,Y|Aand the receiver information of the regular channel_PB|A can be converted into each other. Hence, we can say that a regular channel in the sense of [31] can be written as a conditional additive channel, which shows the “only if” part of Theorem 15.

Example 5

(Binary erasure symmetric channel revisited). We convert the regular channel of Example 4 to a conditional additive channel. Let us label the orbit{0,1}asy=0and{?}asy=1. Let₀₀=0and₀₁=?.

_PX,Y(x,0)=1−p−^p′ifx=0pifx=0

_PX,Y(x,1)=^p′2.

When we consider the nth extension, a channel is given by:

_P^Bn|^An(^bn|^an)=_P^X˜n(_π^an(^bn)),

where the nth extension of the group action is defined by_π^an(^bn)=(_{π_a1}(_b1),…,_{π_an}(_bn)).

Similarly, for n-fold extension, we can also construct the virtual conditional additive channel. More precisely, for^X˜n∼_P^X˜n, we set^Yn=ϖ(^X˜n)=(ϖ(_X˜1),…,ϖ(_X˜n))and:

_P^Xn,^Yn(^xn,^yn):=_P^X˜n(_π^an(_0^yn))|{^{^a′n}∈^An|_π^an(_0^yn)=_π^{^a′n}(_0^yn)}|.

4.3. Conversion of the BPSK-AWGN Channel into the Conditional Additive Channel

Although we only considered finite input/output sources and channels throughout the paper, in order to demonstrate the utility of the conditional additive channel framework, let us consider the additive white Gaussian noise (AWGN) channel with binary phase shift keying (BPSK) in this section. LetA={0,1}be the input alphabet of the channel, and letB=Rbe the output alphabet of the channel. For an inputa∈Aand Gaussian noise Z with mean zero and variance^σ2, the output of the channel is given byB=^(−1)a+Z. Then, the conditional probability density function of this channel is given as:

_PB|A(b|a)=12πσ^{e−^{(b−^(−1)a)2} ^σ2}.

Now, to define a conditional additive channel, we chooseY:=_R+and define the probability density function_pYonYwith respect to the Lebesgue measure and the conditional distribution_PX|Y(x|y)as:

_pY(y):=12πσ(^{e−^(y−1)2 ^σ2}+^{e−^(y+1)2 ^σ2})

_PX|Y(0|y):=^{e−^(y−1)2 ^σ2}^{e−^(y−1)2 ^σ2}+^{e−^(y+1)2 ^σ2}

_PX|Y(1|y):=^{e−^(y+1)2 ^σ2}^{e−^(y−1)2 ^σ2}+^{e−^(y+1)2 ^σ2}

fory∈_R+. When we defineb:=^(−1)xy∈Rforx∈{0,1}andy∈_R+, we have:

_pXY|A(y,x|a)=12πσ^{e−^{(y−^(−1)a+x)2} ^σ2}=12πσ^{e−^{(^(−1)xy−^(−1)a)2} ^σ2}=12πσ^{e−^{(b−^(−1)a)2} ^σ2}.

The relations (202) and (206) show that the AWGN channel with BPSK is given as a conditional additive channel in the above sense.

By noting this observation, as explained in Remark 5, the single-shot achievability bounds in Section 3.2 are also valid for continuous Y. Furthermore, the discussions for the single-shot converse bounds in Section 4.5 hold even for continuous Y. Therefore, the bounds in Section 4.4 and Section 4.5 are also applicable to the BPSK-AWGN channel.

In particular, in the n memoryless extension of the BPSK-AWGN channel, the information measures for the noise distribution are given as n times the single-shot information measures for the noise distribution. Even in this case, the upper and lower bounds in Section 4.4 and Section 4.5 are also applicable by replacing the information measures by n times the single-shot information measures. Therefore, we obtain finite-length upper and lower bounds of the optimal coding length for the memoryless BPSK-AWGN channel. Furthermore, even though the additive noise is not Gaussian, when the probability density function_pZof the additive noise Z satisfies the symmetry_pZ(z)=_pZ(−z), the BPSK channel with the additive noise Z can be converted to a conditional additive channel in the same way.

4.4. Achievability Bound Derived by Source Coding with Side-Information In this subsection, we give a code for a conditional additive channel from a code of source coding with side-information in a canonical way. In this construction, we see that the decoding error probability of the channel code equals that of the source code.

When the channel is given as the conditional additive channel with conditional additive noise distribution_{P^Xn ^Yn} as (186) andX=Ais the finite field_Fq, we can construct a linear channel code from a source code with full side-information whose encoder and decoder are_fnand_dnas follows. First, we assume linearity for the source encoder_fn. Let_Cn(_fn)be the kernel of the linear encoder_fnof the source code. Suppose that the sender sends a codeword_cn∈_Cn(_fn)and(_cn+^Xn,^Yn)is received. Then, the receiver computes the syndrome_fn(_cn+^Xn)=_fn(_cn)+_fn(^Xn)=_fn(^Xn), estimates^Xnfrom_fn(^Xn)and^Yn, and subtracts the estimate from_cn+^Xn. That is, we choose the channel decoder_d˜nas:

_d˜n(^x′n,^yn):=^x′n−_dn(_fn(^x′n),^yn).

We succeed in decoding in this channel coding if and only if_dn(_fn(^Xn),^Yn)equals^Xn. Thus, the error probability of this channel code coincides with that of the source code for the correlated source(^Xn,^Yn) . In summary, we have the following lemma, which was first pointed out in [27].

Lemma 19

([27], (19)). Given a linear encoder_fnand a decoder_dnfor a source code with side-information with distribution_{P^Xn ^Yn}, let_{IKer(_fn)}and_d˜nbe the channel encoder and decoder induced by(_fn,_dn). Then, the error probability of channel coding for the conditionally additive channel with noise distribution_{P^Xn ^Yn}satisfies:

_Pc(n)[(_{IKer(_fn)},_d˜n)|_{P^Xn ^Yn}]=_Ps(n)[(_fn,_dn)|_{P^Xn ^Yn}].

Furthermore, (in fact, when we additionally impose the linearity on the random function F in the definition (114) for the definition of_P¯s(M|_{P^Xn ^Yn}) , the result in [27] implies that the equality in (209) holds) taking the infimum for_Fnchosen to be a linear two-universal hash function, we also have:

_P¯c(n,k)=sup_Fn∈_Fl_{E_Fn}_Pc(n)[Ψ(_Fn)]≤sup_Fn∈_Fl_{E_Fn}_Pc(n)[(_{IKer(_Fn)},_d˜n)]=sup_Fn∈_Fl_{E_Fn} _Ps(n)[(_Fn,_dn)]≤sup_Fn∈F_{E_Fn} _Ps(n)[(_Fn,_dn)]=_P¯s(n)(|^An−k|).

By using this observation and the results in Section 3.2, we can derive the achievability bounds. By using the conversion argument in Section 4.2, we can also construct a channel code for a regular channel from a source code with full side-information. Although the following bounds are just a specialization of known bounds for conditional additive channels, we review these bounds here to clarify the correspondence between the bounds in source coding with side-information and channel coding.

From Lemma 13 and (209), we have the following.

Lemma 20

([2]). The following bound holds:

_P¯c(n,k)≤infγ≥0_{P^Xn ^Yn}log1_P^Xn|^Yn(^xn|^yn)>γ+^eγ ^|A|n−k.

From Lemma 14 and (209), we have the following exponential-type bound.

Lemma 21

([6]). The following bound holds:

_P¯c(n,k)≤inf−12≤θ≤0^{|A|θ(n−k)1+θ} ^{e−θ1+θ_H1+θ↑(^Xn|^Yn)}.

From Lemma 15 and (209), we have the following slightly loose exponential bound.

Lemma 22

([3,70]). The following bound holds (The bound (212) was derived in the original Japanese edition of [3], but it is not written in the English edition [3]. The quantum analogue was derived in [70].):

_P¯c(n,k)≤inf−1≤θ≤0^|A|θ(n−k) ^{e−θ_H1+θ↓(^Xn|^Yn)}.

When X has no side-information, i.e., the virtual channel is additive, we have the following special case of Lemma 21.

Lemma 23

([6]). Suppose that X has no side-information. Then, the following bound holds:

_P¯c(n,k)≤inf−12≤θ≤0^{|A|θ(n−k)1+θ} ^{e−θ1+θ_H1+θ(^Xn)}.

4.5. Converse Bound

In this subsection, we show some converse bounds. The following is the information spectrum-type converse shown in [4].

Lemma 24

([4], Lemma 4). For any code_Ψn=(_en,_dn)and any output distribution_Q^Bn∈P(^Bn), we have:

_Pc(n)[_Ψn]≥supγ≥0∑m=1_Mn1_Mn_P^Bn|^Anlog_P^Bn|^An(^bn|_en(m))_Q^Bn(^bn)<γ−^eγ _Mn.

When a channel is a conditional additive channel, we have:

_P^Bn|^An(^an+^xn,^yn|^an)=_{P^Xn ^Yn}(^xn,^yn).

By taking the output distribution_Q^Bnas:

_Q^Bn(^an+^xn,^yn)=1^|A|n_Q^Yn(^yn)

for some_Q^Yn∈P(^Yn), as a corollary of Lemma 24, we have the following bound.

Lemma 25.

When a channel is a conditional additive channel, for any distribution_Q^Yn∈P(^Yn), we have:

_Pc(n)(_Mn)≥supγ≥0_{P^Xn ^Yn}log_Q^Yn(^yn)_{P^Xn ^Yn}(^xn,^yn)>nlog|A|−γ−^eγ _Mn.

Proof.

By noting (215) and (216), the first term of the right-hand side of (214) can be rewritten as:

∑m=1_Mn1_Mn_P^Bn|^Anlog_P^Bn|^An(^bn|_en(m))_Q^Bn(^bn)<γ

=∑m=1_Mn1_Mn_{P^Xn ^Yn}log_P^Bn|^An(_en(m)+^xn,^yn|_en(m))_Q^Bn(_en(m)+^xn,^yn)

=_{P^Xn ^Yn}log_Q^Yn(^yn)_{P^Xn ^Yn}(^xn,^yn)>nlog|A|−γ,

which implies the statement of the lemma. □

A similar argument as in Theorem 5 also derives from the following converse bound.

Theorem 16.

For any_Q^Yn∈P(^Yn), we have:

−log_Pc(n)(_Mn)≤infs>0θ˜∈R,ϑ≥0[(1+s)θ˜_H1+θ˜(_{P^Xn ^Yn}|_Q^Yn)−_H1+(1+s)θ˜(_{P^Xn ^Yn}|_Q^Yn)

−(1+s)log1−2^{e−−ϑR+(θ˜+ϑ(1+θ˜))_{H1+θ˜+ϑ(1+θ˜)}(_{P^Xn ^Yn}|_Q^Yn)−(1+ϑ)θ˜_H1+θ˜(_{P^Xn ^Yn}|_Q^Yn)1+ϑ}]/s≤infs>0−1<θ˜<θ(a(R))[(1+s)θ˜_H1+θ˜(_{P^Xn ^Yn}|_Q^Yn)−_H1+(1+s)θ˜(_{P^Xn ^Yn}|_Q^Yn)

−(1+s)log1−2^{e(θ(a(R))−θ˜)a(R)−θ(a(R))_H1+θ(a(R))(_{P^Xn ^Yn}|_Q^Yn)+θ˜_H1+θ˜(_{P^Xn ^Yn}|_Q^Yn)}]/s,

whereR=nlog|A|−log_Mn, andθ(a)anda(R) are the inverse functions defined in (29) and (32), respectively.

Proof.

See Appendix L. □

4.6. Finite-Length Bound for the Markov Noise Channel

From this section, we address the conditional additive channel whose conditional additive noise is subject to the Markov chain. Here, the input alphabet^Anequals the additive group^Xn=_Fqn, and the output alphabet^Bnis^X× ^Yn. That is, the transition matrix describing the channel is given by using a transition matrix W on^X× ^Ynand an initial distribution Q as:

_P^Bn|^An(^xn+^an,^yn|^an)=Q(_x1,_y1)∏i=2nW(_xi,_yi|_xi−1,_yi−1).

As in Section 2.2, we consider two assumptions on the transition matrix W of the noise process(X,Y) , i.e., Assumptions 1 and 2. We also use the same notations as in Section 2.2.

Example 6

(Gilbert–Elliot channel with state-information available at the receiver). The Gilbert–Elliot channel [29,30] is characterized by a channel state^Ynon^Yn=^{0,1}nand an additive noise^Xnon^Xn=^{0,1}n. The noise process(^Xn,^Yn)is a Markov chain induced by the transition matrix W introduced in Example 3. For the channel input^an, the channel output is given by(^an+^Xn,^Yn)when the state-information is available at the receiver. Thus, this channel can be regarded as a conditional additive channel, and the transition matrix of the noise process satisfies Assumption 2.

Proofs of the following bounds are almost the same as those in Section 3.3, and thus omitted. The combination of Lemmas 10 and 22 derives the following achievability bound.

Theorem 17

(Direct, Ass. 1). Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. LetR:=n−knlog|A|. Then, we have:

−log_P¯c(n,k)≥sup−1≤θ≤0−θnR+(n−1)θ_H1+θ↓,W(X|Y)+δ̲(θ).

Theorem 16 for_Q^Yn=_P^Ynand Lemma 10 yield the following converse bound.

Theorem 18

(Converse, Ass. 1). Suppose that transition matrix W of the conditional additive noise satisfies Assumption 1. LetR:=log|A|−1nlog_Mn. If^HW(X|Y)<R<_H0↓,W(X|Y), then we have:

−log_Pc(n)(_Mn)≤infs>0−1<θ˜<θ(a(R))[(n−1)(1+s)θ˜_H1+θ˜↓,W(X|Y)−_{H1+(1+s)θ˜↓,W}(X|Y)+_δ1

−(1+s)log1−2^{e(n−1)[(θ(a(R))−θ˜)a(R)−θ(a(R))_{H1+θ(a(R))↓,W}(X|Y)+θ˜_H1+θ˜↓,W(X|Y)]+_δ2}]/s,

whereθ(a)=^θ↓(a)anda(R)=^a↓(R) are the inverse functions defined by (67) and (70), respectively, and:

_δ1:=(1+s)δ¯(θ˜)−δ̲((1+s)θ˜),

_δ2:=(θ(a(R))−θ˜)R−(1+θ˜)δ̲(θ(a(R)))+(1+θ(a(R)))δ¯(θ˜)1+θ(a(R)).

Next, we derive tighter bounds under Assumption 2. From Lemmas 11 and 21, we have the following achievability bound.

Theorem 19

(Direct, Ass. 2). Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. LetR:=n−knlog|A|. Then, we have:

−log_P¯c(n,k)≥sup−12≤θ≤0−θnR+(n−1)θ_H1+θ↑,W(X|Y)1+θ+ξ̲(θ).

By using Theorem 16 for_Q^Yn=_{P^Yn(1+θ(a(R)))}and Lemma 12, we obtain the following converse bound.

Theorem 20

(Converse, Ass. 2). Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. LetR:=log|A|−1nlog_Mn. If^HW(X|Y)<R<_H0↑,W(X|Y), we have:

−log_Pc(n)(_Mn)≤infs>0−1<θ˜<θ(a(R))[(n−1)(1+s)θ˜_{H1+θ˜,1+θ(a(R))W}(X|Y)−_{H1+(1+s)θ˜,1+θ(a(R))W}(X|Y)+_δ1

−(1+s)log1−2^{e(n−1)[(θ(a(R))−θ˜)a(R)−θ(a(R))_{H1+θ(a(R))↑,W}(X|Y)+θ˜_{H1+θ˜,1+θ(a(R))W}(X|Y)]+_δ2}]/s,

whereθ(a)=^θ↑(a)anda(R)=^a↑(R) are the inverse functions defined by (71) and (73), respectively, and:

_δ1:=(1+s)ζ¯(θ˜,θ(a(R)))−ζ̲((1+s)θ˜,θ(a(R))),

_δ2:=(θ(a(R))−θ˜)R−(1+θ˜)ζ̲(θ(a(R)),θ(a(R)))+(1+θ(a(R)))ζ¯(θ˜,θ(a(R)))1+θ(a(R)).

Finally, when X has no side-information, i.e., the channel is additive, we obtain the following achievability bound from Lemma 23.

Theorem 21

(Direct, no-side-information). LetR:=n−knlog|A|. Then, we have:

−log_P¯c(n,k)≥sup−12≤θ≤0−θnR+(n−1)θ_H1+θW(X)+δ̲(θ)1+θ.

Remark 9.

Our treatment for the Markov conditional additive channel covers Markov regular channels because Markov regular channels can be reduced to Markov conditional additive channels as follows. LetX˜=_{{^X˜n}n=1∞}be a Markov chain onBwhose distribution is given by:

_P^X˜n(^x˜n)=Q(_x˜1)∏i=2nW˜(_x˜i|_x˜i−1)

for a transition matrixW˜and an initial distribution Q. Let(X,Y)=_{{(^Xn,^Yn)}n=1∞}be the noise process of the conditional additive channel derived from the noise processX˜ of the regular channel by the argument of Section 4.2. Since we can write:

_{P^Xn ^Yn}(^xn,^yn)=Q(_{ι_y1−1}(_{ϑ_y1}(_x1)))1|Stb(_{0_y1})|∏i=2nW˜(_{ι_yi−1}(_{ϑ_yi}(_xi))|_{ι_yi−1−1}(_{ϑ_yi−1}(_xi−1)))1|Stb(_{0_yi})|,

the process(X,Y)is also a Markov chain. Thus, the regular channel given byX˜is reduced to the conditional additive channel given by(X,Y).

4.7. Second-Order

To discuss the asymptotic performance, we introduce the quantity:

C:=log|A|−^HW(X|Y).

By applying the central limit theorem (cf. [67] (Theorem 27.4, Example 27.6)) to Lemmas 20 and 25 for_Q^Yn=_P^Yn, and by using Theorem 2, we have the following.

Theorem 22.

Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. For arbitraryε∈(0,1), we have:

logM(n,ε)=k(n,ε)log|A|=Cn+^VW(X|Y)^Φ−1(ε)n+o(n).

Proof. This theorem follows in the same manner as the proof of Theorem 11 by replacing Lemma 13 with Lemma 20 (achievability) and Lemma 18 with Lemma 25 (converse). □

From the above theorem, the (first-order) capacity of the conditional additive channel under Assumption 1 is given by:

limn→∞1nlogM(n,ε)=limn→∞1nlogk(n,ε)log|A|n=C

for every0<ε<1. In the next subsections, we consider the asymptotic behavior of the error probability when the rate is smaller than the capacity in the moderate deviation regime and the large deviation regime, respectively.

4.8. Moderate Deviation From Theorems 17 and 18, we have the following.

Theorem 23.

Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. For arbitraryt∈(0,1/2)andδ>0, we have:

limn→∞−1^n1−2tlog_Pc(n)^{enC−^n1−tδ}=limn→∞−1^n1−2tlog_P¯c(n)n,nC−^n1−tδlog|A|

=^δ22^VW(X|Y).

Proof. The theorem follows in the same manner as Theorem 12 by replacing Theorem 6 with Theorem 17 (achievability) and Theorem 8 with Theorem 18 (converse). □

4.9. Large Deviation From Theorem 17 and Theorem 18, we have the following.

Theorem 24.

Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. For^HW(X|Y)<R, we have:

lim infn→∞−1nlog_P¯c(n)n,n1−Rlog|A|≥sup−1≤θ≤0−θR+θ_H1+θ↓,W(X|Y).

On the other hand, for^HW(X|Y)<R<_H0↓,W(X|Y), we have:

lim supn→∞−1nlog_Pc(n)^{en(log|A|−R)}≤−θ(a(R))a(R)+θ(a(R))_{H1+θ(a(R))↓,W}(X|Y)

=sup−1<θ≤0−θR+θ_H1+θ↓,W(X|Y)1+θ.

Proof. The theorem follows in the same manner as Theorem 13 by replacing Theorem 6 with Theorem 17 (achievability) and Theorem 8 with Theorem 18 (converse). □

Under Assumption 2, from Theorems 19 and 20, we have the following tighter bound.

Theorem 25.

Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. For^HW(X|Y)<R, we have:

lim infn→∞−1nlog_P¯c(n)n,n1−Rlog|A|≥sup−12≤θ≤0−θR+θ_H1+θ↑,W(X|Y)1+θ.

On the other hand, for^HW(X|Y)<R<_H0↑,W(X|Y), we have:

lim supn→∞−1nlog_Pc(n)^{en(log|A|−R)}≤−θ(a(R))a(R)+θ(a(R))_{H1+θ(a(R))↑,W}(X|Y)

=sup−1<θ≤0−θR+θ_H1+θ↑,W(X|Y)1+θ.

Proof. The theorem follows the same manner as Theorem 14 by replacing Theorem 9 with Theorem 19 and Theorem 10 with Theorem 20. □

When X has no side-information, i.e., the channel is additive, from Theorem 21 and (245), we have the following.

Theorem 26.

For^HW(X)<R, we have:

lim infn→∞−1nlog_P¯c(n)n,n1−Rlog|A|≥sup−12≤θ≤0−θR+θ_H1+θW(X)1+θ.

On the other hand, for^HW(X)<R<_H0W(X), we have:

lim supn→∞−1nlog_Pc(n)^{en(log|A|−R)}≤sup−1<θ≤0−θR+θ_H1+θW(X)1+θ.

Proof.

The first claim follows by taking the limit of Theorem 21, and the second claim follows as a special case of (245) without side-information. □

4.10. Summary of the Results

The results shown in this section for the Markov conditional additive noise are summarized in Table 3. The check marks ?? indicate that the tight asymptotic bounds (large deviation, moderate deviation, and second-order) can be obtained from those bounds. The marks ??^*indicate that the large deviation bound can be derived up to the critical rate. The computational complexity “Tail” indicates that the computational complexities of those bounds depend on the computational complexities of tail probabilities. It should be noted that Theorem 18 is derived from a special case (_QY=_PY) of Theorem 16. The asymptotically optimal choice is_QY=_PY(1+θ). Under Assumption 1, we can derive the bound of the Markov case only for that special choice of_QY, while under Assumption 2, we can derive the bound of the Markov case for the optimal choice of_QY. Furthermore, Theorem 18 is not asymptotically tight in the large deviation regime in general, but it is tight if X has no side-information, i.e., the channel is additive. It should be also noted that Theorem 20 does not imply Theorem 18 even for the additive channel case since Assumption 2 restricts the structure of transition matrices even when X has no side-information.

5. Discussion and Conclusions

In this paper, we developed a unified approach to source coding with side-information and channel coding for a conditional additive channel for finite-length and asymptotic analyses of Markov chains. In our approach, the conditional Rényi entropies defined for transition matrices played important roles. Although we only illustrated the source coding with side-information and the channel coding for a conditional additive channel as applications of our approach, it could be applied to some other problems in information theory such as random number generation problems, as shown in another paper [60].

Our obtained results for the source coding with side-information and the channel coding of the conditional additive channel has been extended to the case when the side-information is continuous like the real line and the joint distribution X and Y is memoryless. Since this case covers the BPSK-AWGN channel, it can be expected that it covers the MPSK-AWGN channel. Since such channels are often employed in the real channel coding, it is an interesting future topic to investigate the finite-length bound for these channels. Further, we could not define the conditional Rényi entropy for transition matrices of continuous Y. Hence, our result could not be extended to such a continuous case. It is another interesting future topic to extend the obtained result to the case with continuous Y.

Problem	First-Order	Large Deviation	Moderate Deviation	Second-Order
SC with SI	Solved (Ass. 1)	^Solved*(Ass. 2)	Solved (Ass. 1),	Solved (Ass. 1)
SC with SI	Solved (Ass. 1)	O(1)	O(1)	Tail
CC for Conditional	Solved (Ass. 1)	^Solved*(Ass. 2)	Solved (Ass. 1)	Solved (Ass. 1)
Additive Channels	Solved (Ass. 1)	O(1)	O(1)	Tail

Ach./Conv.	Markov	Single-Shot	_Ps/_P¯s	Complexity	Large Deviation	Moderate Deviation	Second Order
Achievability	Theorem 6 (Ass. 1)	Lemma 15	_P¯s	O(1)		??
	Theorem 9 (Ass. 2)	Lemma 14	_P¯s	O(1)	??^*	??
	Theorem 7 (No-side)	Lemma 16	_P¯s	O(1)	??^*	??
	Lemma 13		_P¯s	Tail		??	??
Converse	Theorem 8 (Ass. 1)	(Theorem 5)	_Ps	O(1)		??
	Theorem 10 (Ass. 2)	Corollary 1	_Ps	O(1)	??^*	??
	Theorem 8 (No-side)	(Theorem 5)	_Ps	O(1)	??^*	??
	Lemma 18		_Ps	Tail		??	??

Ach./Conv.	Markov	Single-Shot	_Pc/_P¯c	Complexity	Large Deviation	Moderate Deviation	Second Order
Achievability	Theorem 17 (Ass. 1)	Lemma 22	_P¯c	O(1)		??
	Theorem 19 (Ass. 2)	Lemma 21	_P¯c	O(1)	??^*	??
	Theorem 21 (Additive)	Lemma 23	_P¯c	O(1)	??^*	??
	Lemma 20		_P¯c	Tail		??	??
Converse	Theorem 18 (Ass. 1)	(Theorem 16)	_Pc	O(1)		??
	Theorem 20 (Ass. 2)	Theorem 16	_Pc	O(1)	??^*	??
	Theorem 18 (Additive)	(Theorem 16)	_Pc	O(1)	??^*	??
	Lemma 25		_Pc	Tail		??	??

Author Contributions

Conceptualization, M.H.; methodology, S.W.; formal analysis, S.W. and M.H.; writing, original draft preparation, S.W.; writing, review and editing, M.H. All authors read and agreed to the published version of the manuscript.

Funding

M.H. is partially supported by the Japan Society of the Promotion of Science (JSPS) Grant-in-Aid for Scientific Research (A) No. 23246071, (A) No. 17H01280, (B) No. 16KT0017, the Okawa Research Grant, and Kayamori Foundation of Informational Science Advancement. He is also partially supported by the National Institute of Information and Communication Technology (NICT), Japan. S.W. is supported in part by the Japan Society of the Promotion of Science (JSPS) Grant-in-Aid for Young Scientists (A) No. 16H06091.

Acknowledgments

The authors would like to thank Vincent Y. F. Tan for pointing out Remark 6. The authors are also grateful to Ryo Yaguchi for his helpful comments.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

RCU	random coding union
BSC	binary symmetric channel
DMC	discrete memoryless channel
DT	dependence testing
LDPC	low-density parity check
BPSK	binary phase shift keying
AWGN	additive white Gaussian noise
CGF	cumulant generating function
MPSK	M-ary phase shift keying
CC	channel coding
SC	source coding
SI	side-information

Appendix A. Preparation for the Proofs

When we prove some properties of Rényi entropies or derive converse bounds, some properties of cumulant generating functions (CGFs) become useful. For this purpose, we introduce some terminologies in statistics from [22,23]. Then, in Appendix B, we show the relation between the terminologies in statistics and those in information theory. For the proofs, see [22,23].

Appendix A.1. Single-Shot Setting

Let Z be a random variable with distribution P. Let:

ϕ(ρ):=logE^eρZ

=log∑zP(z)^eρZ

be the cumulant generating function (CGF). Let us introduce an exponential family:

_Pρ(z):=P(z)^eρz-ϕ(ρ).

By differentiating the CGF, we find that:

^ϕ'(ρ)=_Eρ[Z]

:=∑z_Pρ(z)z.

We also find that:

^ϕ''(ρ)=∑z_Pρ(z)^z-_Eρ[Z]2.

We assume that Z is not constant. Then, (A6) implies thatϕ(ρ)is a strict convex function and^ϕ'(ρ)is monotonically increasing. Thus, we can define the inverse functionρ(a)of^ϕ'(ρ)by:

^ϕ'(ρ(a))=a.

Let:

_D1+s(P∥Q):=1slog∑zP^(z)1+sQ^(z)-s

be the Rényi divergence. Then, we have the following relation:

s_D1+s(_Pρ˜∥_Pρ)=ϕ((1+s)ρ˜-sρ)-(1+s)ϕ(ρ˜)+sϕ(ρ).

Appendix A.2. Transition Matrix

Let_{{W(z|^z')}(z,^z')∈^Z2}be an ergodic and irreducible transition matrix, and letP˜be its stationary distribution. For a functiong:Z×Z→R, let:

E[g]:=∑z,^z'P˜(^z')W(z|^z')g(z,^z').

We also introduce the following tilted matrix:

_Wρ(z|^z'):=W(z|^z')^{eρg(z,^z')}.

Let_λρbe the Perron-Frobenius eigenvalue of_Wρ. Then, the CGF for W with generator g is defined by:

ϕ(ρ):=log_λρ.

Lemma A1.

The functionϕ(ρ)is a convex function of ρ, and it is strict convex iff^ϕ''(0)>0.

From Lemma A1,^ϕ'(ρ)is a monotone increasing function. Thus, we can define the inverse functionρ(a)of^ϕ'(ρ)by:

^ϕ'(ρ(a))=a.

Appendix A.3. Markov Chain

LetZ=_{^Zn}n=1∞be the Markov chain induced byW(z|^z')and an initial distribution_{P_Z1}. For functionsg:Z×Z→Randg˜:Z→R, let_Sn:=_∑i=2ng(_Zi,_Zi-1)+g˜(_Z1). Then, the CGF for_Snis given by:

_ϕn(ρ):=logE^eρ_Sn.

We will use the following finite evaluation for_ϕn(ρ).

Lemma A2.

Let_vρbe the eigenvector of_WρTwith respect to the Perron-Frobenius eigenvalue_λρsuch that_minz _vρ(z)=1. Let_wρ(z):=_{P_Z1}(z)^eρg˜(z). Then, we have:

(n-1)ϕ(ρ)+_δ̲ϕ(ρ)≤_ϕn(ρ)≤(n-1)ϕ(ρ)+_δ¯ϕ(ρ),

where:

_δ¯ϕ(ρ):=log〈_vρ|_wρ〉,

_δ̲ϕ(ρ):=log〈_vρ|_wρ〉-logmaxz_vρ(z).

From this lemma, we have the following.

Corollary A1.

For any initial distribution andρ∈R, we have:

limn→∞_ϕn(ρ)=ϕ(ρ).

The relation:

limn→∞1nE[_Sn]=^ϕ'(0)

=E[g]

is well known. Furthermore, we also have the following.

Lemma A3.

For any initial distribution, we have:

limn→∞1nVar_Sn=^ϕ''(0).

Appendix B. Relation Between CGF and Conditional Rényi Entropies

Appendix B.1. Single-Shot Setting

For correlated random variable(X,Y), let us considerZ=log_QY(Y)_PXY(X,Y). Then, the relation between the CGF and conditional Rényi entropy relative to_QYis given by:

θ_H1+θ(_PXY|_QY)=-ϕ(-θ;_PXY|_QY).

From this, we can also find that the relationship between the inverse functions (cf. (29) and (A7)):

θ(a)=-ρ(a).

Thus, the inverse function defined in (32) also satisfies:

(1-ρ(a(R))a(R)+ϕ(ρ(a(R));_PXY|_QY)=R.

Similarly, by settingZ=log1_PX|Y(X|Y), we have:

θ_H1+θ↓(X|Y)=-ϕ(-θ;_PXY|_PY).

Then, the variance (cf. (11)) satisfies:

V(X|Y)=^ϕ''(0;_PXY|_PY).

Letϕ(ρ,^ρ')be the CGF ofZ=log_PY(1-^ρ')(Y)_PXY(X,Y) (cf. (15) for the definition of_PY(1-^ρ')). Then, we have:

θ_H1+θ,1+^θ'(X|Y)=-ϕ(-θ,-^θ').

It should be noted thatϕ(ρ,^ρ')is a CGF for fixed^ρ', butϕ(ρ,ρ)cannot be treated as a CGF.

Appendix B.2. Transition Matrix

For transition matrixW(x,y|^x',^y'), we consider the function given by:

g((x,y),(^x',^y')):=logW(y|^y')W(x,y|^x',^y').

Then, the relation between the CGF and the lower conditional Rényi entropy is given by:

θ_H1+θ↓,W(X|Y)=-ϕ(-θ).

Then, the variance defined in (51) satisfies:

^VW(X|Y)=^ϕ''(0).

Appendix C. Proof of Lemma 2

We use the following lemma.

Lemma A4.

Forθ∈(-1,0)∪(0,1), we have:

_H11-θ↓(X|Y)≤_H11-θ↑(X|Y)≤_H1+θ↓(X|Y).

Proof.

The left hand side inequality of (A31) is obvious from the definition of two Rényi entropies (the latter is defined by taking the maximum). The right-hand side inequality was proven in [71] (Lemma 6). □

Now, we go back to the proof of Lemma 2. From (10) and (11), by the Taylor approximation, we have:

_H1+θ↓(X|Y)=H(X|Y)-12V(X|Y)θ+o(θ).

Furthermore, since11-θ=1+θ+o(θ), we also have:

_H11-θ↓(X|Y)=H(X|Y)-12V(X|Y)θ+o(θ).

Thus, from Lemma A4, we can derive (16) and (17).

Appendix D. Proof of Lemma 3

Statements 1 and 3 follow from the relationships in (A22) and (A25) and the strict convexity of the CGFs.

To prove Statement 5, we first prove the strict convexity of the Gallager function:

_E0(τ;_PXY):=log∑y_PY(y)^{∑x_PX|Y ^(x|y)11+τ1+τ}

forτ>-1. We use the Hölder inequality:

∑i_aiα _biβ≤^∑i_aiα ^∑i_biβ

forα,β>0such thatα+β=1, where the equality holds iff_ai=c_bifor some constant c. Forλ∈(0,1), let1+_τ3=λ(1+_τ1)+(1-λ)(1+_τ2), which implies:

11+_τ3=11+_τ1λ(1+_τ1)1+_τ3+11+_τ2(1-λ)(1+_τ2)1+_τ3

and:

λ(1+_τ1)1+_τ3+(1-λ)(1+_τ2)1+_τ3=1.

Then, by applying the Hölder inequality twice, we have:

∑y_PY(y)^{∑x_PX|Y ^(x|y)11+_τ31+_τ3}

=∑y_PY(y)^{∑x_PX|Y ^{(x|y)11+_τ1λ(1+_τ1)1+_τ3} _PX|Y ^{(x|y)11+_τ2(1-λ)(1+_τ2)1+_τ3}1+_τ3}

≤∑y_PY(y)^{^{∑x_PX|Y ^(x|y)11+_τ1λ(1+_τ1)1+_τ3} ^{∑x_PX|Y ^(x|y)11+_τ2(1-λ)(1+_τ2)1+_τ3}1+_τ3}

=∑y_PY(y)^{∑x_PX|Y ^(x|y)11+_τ1λ(1+_τ1)} ^{∑x_PX|Y ^(x|y)11+_τ2(1-λ)(1+_τ2)}

=∑y_PY ^(y)λ ^{∑x_PX|Y ^(x|y)11+_τ1λ(1+_τ1)} _PY ^(y)1-λ ^{∑x_PX|Y ^(x|y)11+_τ2(1-λ)(1+_τ2)}

≤^{∑y_PY(y)^{∑x_PX|Y ^(x|y)11+_τ1(1+_τ1)}λ} ^{∑y_PY(y)^{∑x_PX|Y ^(x|y)11+_τ2(1+_τ2)}1-λ}.

The equality in the second inequality holds iff:

^{∑x_PX|Y ^(x|y)11+_τ11+_τ1}=c^{∑x_PX|Y ^(x|y)11+_τ21+_τ2}∀y∈Y

for some constant c. Furthermore, the equality in the first inequality holds iff_PX|Y(x|y)=1|supp(_PX|Y(·|y))| . Substituting this into (A44), we find that|supp(_PX|Y(·|y))|is irrespective of y. Thus, both the equalities hold simultaneously iffV(X|Y)=0. Now, since:

θ_H1+θ↑(X|Y)=-(1+θ)_E0-θ1+θ;_PXY,

we have:

^d2[θ_H1+θ↑(X|Y)]d^θ2=-1^(1+θ)4_E0''-θ1+θ;_PXY

≤0

forθ∈(-1,∞), where the equality holds iffV(X|Y)=0.

Statement 7 is obvious from the definitions of the two measures. The first part of Statement 8 follows from (A27) and the convexity of the CGF, but we need another argument to check the conditions for strict concavity. Since the second term of:

θ_H1+θ,1+^θ'(X|Y)=-log∑y_PY(y)∑x_PX|Y ^(x|y)1+θ^{∑x_PX|Y ^{(x|y)1+^θ'}θ1+^θ'}+θ^θ'1+^θ'_H1+^θ'↑(X|Y)

is linear with respect toθ, it suffices to show the strict concavity of the first term. By using the Hölder inequality twice, for_θ3=λ_θ1+(1-λ)_θ2, we have:

∑y_PY(y)∑x_PX|Y ^(x|y)1+_θ3^{∑x_PX|Y ^{(x|y)1+^θ'}_θ31+^θ'}

≤∑y_PY(y)^{∑x_PX|Y ^(x|y)1+_θ1λ} ^{∑x_PX|Y ^(x|y)1+_θ21-λ} ^{∑x_PX|Y ^{(x|y)1+^θ'}λ_θ1+(1-λ)_θ21+^θ'}

≤^{∑y_PY(y)∑x_PX|Y ^(x|y)1+_θ1^{∑x_PX|Y ^{(x|y)1+^θ'}_θ11+^θ'}λ}

^{∑y_PY(y)∑x_PX|Y ^(x|y)1+_θ2^{∑x_PX|Y ^{(x|y)1+^θ'}_θ21+^θ'}1-λ},

where both the equalities hold simultaneously iffV(X|Y)=0, which can be proven in a similar manner as the equality conditions in (A40) and (A43). Thus, we have the latter part of Statement 8.

Statements 10-12 are also obvious from the definitions. Statements 2, 4, 6, and 9, follow from Statements 1, 3, 5, and 8, (cf. [71], Lemma 1).

Appendix E. Proof of Lemma 4

Since (24) and (28) are obvious from the definitions, we only prove (26). We note that:

^{∑y_PY(y)^{∑x_PX|Y ^(x|y)1+θ11+θ}1+θ}

≤^{∑y_PY(y)^{|supp(_PX|Y(·|y))|11+θ}1+θ}

≤maxy∈supp(_PY)|supp(_PX|Y(·|y))|

and:

^{∑y_PY(y)^{∑x_PX|Y ^(x|y)1+θ11+θ}1+θ}

≥_PY ^{(^y*)1+θ}∑x_PX|Y ^{(x|^y*)1+θ}

→θ→-1|supp(_PX|Y(·|^y*))|,

where:

^y*:=argmaxy∈supp(_PY)|supp(_PX|Y(·|y))|.

Appendix F. Proof of Lemma 6

From Lemma A4, Theorems 1 and 3, we have:

_H11-θ↓,W(X|Y)≤_H11-θ↑,W(X|Y)≤_H1+θ↓,W(X|Y)

forθ∈(-1,0)∪(0,1). Thus, we can prove Lemma 6 in the same manner as Lemma 2.

Appendix G. Proof of (63)

First, in the same manner as Theorem 1, we can show:

limn→∞1n_H1+θ(_{P^Xn ^Yn}|_Q^Yn)=_H1+θW|V(X|Y),

where_Q^Ynis a Markov chain induced by V for some initial distribution. Then, since_H1+θ(_{P^Xn ^Yn}|_Q^Yn)≤_H1+θ↑(^Xn|^Yn)for each n, by using Theorem 3, we have:

_H1+θW|V(X|Y)≤_H1+θ↑,W(X|Y).

Thus, the rest of the proof is to show that_H1+θ↑,W(X|Y)is attainable by some V.

Let_Q^θbe the normalized left eigenvector of_Kθ, and let:

_Vθ(y|^y'):=_Q^θ(y)_κθ _Q^θ(^y')_Kθ(y|^y').

Then,_Vθattains the maximum. To prove this, we will show that_κθ1+θis the Perron-Frobenius eigenvalue of:

W^{(x,y|^x',^y')1+θ} _Vθ ^{(y|^y')-θ}.

We first confirm that(_Q^θ ^(y)1+θ:(x,y)∈X×Y) is an eigenvector of (A64) as follows:

∑x,y_Q^θ ^(y)1+θW^{(x,y|^x',^y')1+θ} _Vθ ^{(y|^y')-θ}

=∑y_Q^θ ^(y)1+θ _Wθ(y|^y')^{_Q^θ(y)_κθ _Q^θ(^y')_Wθ ^{(y|^y')11+θ}-θ}

=_κθθ _Q^θ ^{(^y')θ}∑y_Q^θ(y)_Wθ ^{(y|^y')11+θ}

=_κθ1+θ _Q^θ ^{(^y')1+θ}.

Since(_Q^θ ^(y)1+θ:(x,y)∈X×Y)is a positive vector and the Perron-Frobenius eigenvector is the unique positive eigenvector, we find that_κθ1+θis the Perron-Frobenius eigenvalue. Thus, we have:

_{H1+θW|_Vθ}(X|Y)=-1+θθlog_κθ

=_H1+θ↑,W(X|Y).

Appendix H. Proof of Lemma 7

Statement 1 follows from (A29) and the strict convexity of the CGF. Statements 5 and 8-10 follow from the corresponding statements in Lemma 3, Theorems 1, 3 and 4.

Now, we prove (the concavity ofθ_H1+θ↑,W(X|Y)follows from the limiting argument, i.e., the concavity ofθ_H1+θ↑(^Xn|^Yn)(cf. Lemma 3) and Theorem 3. However, the strict concavity does not follow from the limiting argument; Statement 3. For this purpose, we introduce the transition matrix counterpart of the Gallager function as follows. Let:

_K¯τ(y|^y'):=W(y|^y')^{∑xW^{(x|^x',^y',y)11+τ}1+τ}

forτ>-1, which is well defined under Assumption 2. Let_κ¯τbe the Perron-Frobenius eigenvalue of_K¯τ, and let_Q˜τand_Q^τbe its normalized right and left eigenvectors. Then, let:

_Lτ(y|^y'):=_Q^τ(y)_κ¯τ _Q^τ(^y')_K¯τ(y|^y')

be a parametrized transition matrix. The stationary distribution of_Lτis given by:

_Qτ(^y'):=_Q^τ(^y')_Q˜τ(^y')_∑^y'' _Q^τ(^y'')_Q˜τ(^y'').

We prove the strict convexity of_E0W(τ):=log_κ¯τforτ>-1 . Then, by the same reason as (A46), we can show Statement 3. Let_Qτ(y,^y'):=_Lτ(y|^y')_Qτ(^y') . By the same calculation as [22] (Proof of Lemmas 13 and 14), we have:

∑y,^y'_Qτ(y,^y')^{ddτlog_Lτ(y|^y')2}=-∑y,^y'_Qτ(y,^y')^d2d^τ2log_Lτ(y|^y').

Furthermore, from the definition of_Lτ, we have:

-∑y,^y'_Qτ(y,^y')^d2d^τ2log_Lτ(y|^y')

=-∑y,^y'_Qτ(y,^y')^d2d^τ2log1_κτ+^d2d^τ2log_Q^τ(y)_Q^τ(^y')+^d2d^τ2log_Kτ(y|^y')

=^d2d^τ2log_κτ-∑y,^y'_Qτ(y,^y')^d2d^τ2log_Kτ(y|^y').

Now, we show the convexity oflog_K¯τ(y|^y')for each(y,^y') . By using the Hölder inequality (cf. Appendix D), for_τ3=λ_τ1+(1-λ)_τ2, we have:

^{∑xW^{(x|^x',^y',y)11+_τ3}1+_τ3}≤^{∑xW^{(x|^x',^y',y)11+_τ1}λ(1+_τ1)} ^{∑xW^{(x|^x',^y',y)11+_τ2}(1-λ)(1+_τ2)}.

Thus,_E0W(τ) is convex. To check strict convexity, we note that the equality in (A78) holds iffW(x|^x',^y',y)=1|supp(W(·|^x',^y',y))|. Since:

∑xW^{(x|^x',^y',y)1+θ}=1|supp(W(·|^x',^y',y))^|θ

does not depend on^x'from Assumption 2, we have|supp(W(·|^x',^y',y))|=_Cy^y'for some integer_Cy^y'. By substituting this into_K¯τ, we have:

_K¯τ(y|^y')=W(y|^y')_Cy^y'τ.

On the other hand, we note that the CGFϕ(ρ)is defined as the logarithm of the Perron-Frobenius eigenvalue of:

W^{(x,y|^w',^y')1-ρ}W^{(y|^y')ρ}=W(y|^y')1_Cy^y'1-ρ1[x∈supp(W(·|^x',^y',y))].

Since:

∑x,y_Q^τ(y)W(y|^y')1_Cy^y'1-τ1[x∈supp(W(·|^x',^y',y))]

=∑y_Q^τ(y)W(y|^y')_Cy^y'τ

=_κ¯τ _Q^τ(^y'),

_κ¯τ is the Perron-Frobenius eigenvalue of (A81), and thus, we have_E0W(τ)=ϕ(τ) when the equality in (A78) holds for every(y,^y')such thatW(y|^y')>0. Sinceϕ(τ)is strict convex if^VW(X|Y)>0,_E0W(τ)is strict convex if^VW(X|Y)>0. Thus,θ_H1+θ↑,W(X|Y)is strict concave if^VW(X|Y)>0 . On the other hand, from (57),θ_H1+θ↑,W(X|Y)is strict concave only if^VW(X|Y)>0.

Statement 6 can be proven by modifying the proof of Statement 8 of Lemma 3 to a transition matrix in a similar manner as Statement 3 of the present lemma.

Finally, Statements 2, 4 and 7 follow from Statements 1, 3 and 6 (cf. [71], Lemma 1).

Appendix I. Proof of Lemma 9

We only prove (75) since we can prove (76) exactly in the same manner by replacing_H1+θ↓,W(X|Y),^θ↓(a), and^a↓(R)by_H1+θ↑,W(X|Y),^θ↑(a), and^a↑(R). Let:

f(θ):=-θR+θ_H1+θ↓,W(X|Y)1+θ.

Then, we have:

^f'(θ)=-R+(1+θ)d[θ_H1+θ↓,W(X|Y)]dθ-θ_H1+θ↓,W(X|Y)^(1+θ)2

=-R+Rd[θ_H1+θ↓,W(X|Y)]dθ^(1+θ)2.

SinceR(a)is monotonically increasing andd[θ_H1+θ↓,W(X|Y)]dθis monotonically decreasing, we have^f'(θ)≥0forθ≤θ(a(R))and^f'(θ)≤0forθ≥θ(a(R)). Thus,f(θ)takes its maximum atθ(a(R)). Furthermore, since-1≤θ(a(R))≤0for^HW(X|Y)≤R≤_H0↓,W(X|Y), we have:

sup-1≤θ≤0-θR+θ_H1+θ↓,W(X|Y)1+θ

=-θ(a(R))R+θ(a(R))_{H1+θ(a(R))↓,W}(X|Y)1+θ(a(R))

=-θ(a(R))[(1+θ(a(R)))a(R)-θ(a(R))_{H1+θ(a(R))↓,W}(X|Y)]+θ(a(R))_{H1+θ(a(R))↓,W}(X|Y)1+θ(a(R))

=-θ(a(R))a(R)+θ(a(R))_{H1+θ(a(R))↓,W}(X|Y),

where we substitutedR=R(a(R))in the second equality.

Appendix J. Proof of Lemma 11

Let u be the vector such thatu(y)=1for everyy∈Y. From the definition of_H1+θ↑(^Xn|^Yn), we have the following sequence of calculations:

^{e-θ1+θθ_H1+θ↑(^Xn|^Yn)}

=∑_y1,...,_yn^{∑_xn,...,_x1P^{(_x1,_y1)1+θ}∏i=2nW^{(_xi,_yi|_xi-1,_yi-1)1+θ}11+θ}

=(a)∑_yn,...,_y1^{∑_x1P^{(_x1,_y1)1+θ}11+θ}∏i=2n_Wθ ^{(_yi|_yi-1)11+θ}

=〈u|_Kθn-1 _wθ〉

≤〈_vτ|_Kθn-1 _wθ〉

=〈^(_KθT)n-1 _vθ|_wθ〉

=_κθn-1〈_vθ|_wθ〉

=^{e-(n-1)θ1+θ_H1+θ↑,W(X|Y)}〈_vθ|_wθ〉,

which implies the left-hand side inequality, where we used Assumption 2 in(a). On the other hand, we have the following sequence of calculations:

^{e-θ1+θθ_H1+θ↑(^Xn|^Yn)}

=〈u|_Kθn-1 _wθ〉

≥1_maxy _vθ(y)〈_vθ|_Kθn-1 _wθ〉

=1_maxy _vθ(y)〈^(_KθT)n-1 _vθ|_wθ〉

=_κθn-1〈_vθ|_wθ〉_maxy _vθ(y)

=^{e-(n-1)θ1+θ_H1+θ↑,W(X|Y)}〈_vθ|_wθ〉_maxy _vθ(y),

which implies the right-hand side inequality.

Appendix K. Proof of Theorem 5

For arbitraryρ˜∈R, we setα:=_PXY{X≠d(e(X),Y)}andβ:=_PXY,ρ˜{X≠d(e(X),Y)}, where:

_PXY,ρ(x,y):=_PXY ^(x,y)1-ρ _QY ^(y)ρ ^{e-ϕ(ρ;_PXY|_QY)}.

Then, by the monotonicity of the Rényi divergence, we have:

s_D1+s(_PXY,ρ˜∥_PXY)≥log^β1+s ^α-s+^(1-β)1+s ^(1-α)-s

≥log^β1+s ^α-s.

Thus, we have:

-logα≤ϕ((1+s)ρ˜;_PXY|_QY)-(1+s)ϕ(ρ˜;_PXY|_QY)-(1+s)logβs.

Now, by using Lemma 18, we have:

1-β≤_PXY,ρ˜log_QY(y)_PXY,ρ˜(x,y)≤γ+M^eγ.

We also have, for anyσ≤0,

_PXY,ρ˜log_QY(y)_PXY,ρ˜(x,y)≤γ

≤∑x,y_PXY,ρ˜(x,y)^{eσlog_QY(y)_PXY,ρ˜(x,y)-γ}

=^{e-[σγ-ϕ(σ;_PXY,ρ˜|_QY)]}.

Thus, by settingγso that:

σγ-ϕ(σ;_PXY,ρ˜|_QY)=γ-R,

we have

1-β≤2^{e-σR-ϕ(σ;_PXY,ρ˜|_QY)1-σ}.

Furthermore, we have the relation:

ϕ(σ;_PXY,ρ˜|_QY)=log∑x,y_PXY,ρ˜ ^(x,y)1-σ _QY ^(y)σ

=log∑x,y^{_PXY ^(x,y)1-ρ˜ _QY ^(y)ρ˜ ^{e-ϕ(ρ˜;_PXY|_QY)}1-σ} _QY ^(y)σ

=-(1-σ)ϕ(ρ˜;_PXY|_QY)+log∑x,y_PXY ^{(x,y)1-ρ˜-σ(1-ρ˜)} _QY ^{(y)ρ˜+σ(1-ρ˜)}

=ϕ(ρ˜+σ(1-ρ˜);_PXY|_QY)-(1-σ)ϕ(ρ˜;_PXY|_QY).

Thus, by substitutingρ˜=-θ˜andσ=-ϑ and by using (A22), we can derive (124).

Now, we restrict the range ofρ˜so thatρ(a(R))<ρ˜<1and take:

σ=ρ(a(R))-ρ˜1-ρ˜.

Then, by substituting this into (A119) and (A119) into (A115), we have (ϕ(ρ;_PXY|_QY)is omitted asϕ(ρ)):

σR-ϕ(ρ˜+σ(1-ρ˜))+(1-σ)ϕ(ρ˜)1-σ

=(ρ(a(R))-ρ˜)R-(1-ρ˜)ϕ(ρ(a(R)))+(1-ρ(a(R)))ϕ(ρ˜)1-ρ(a(R))

=(ρ(a(R))-ρ˜)(1-ρ(a(R)))a(R)+ϕ(ρ(a(R)))-(1-ρ˜)ϕ(ρ(a(R)))+(1-ρ(a(R)))ϕ(ρ˜)1-ρ(a(R))

=(ρ(a(R))-ρ˜)a(R)-ϕ(ρ(a(R)))+ϕ(ρ˜),

where we used (A24) in the second equality. Thus, by substitutingρ˜=-θ˜ and by using (A22) again, we have (125).

Appendix L. Proof of Theorem 16

Let:

_{P^Xn ^Yn,ρ}(^xn,^yn):=_{P^Xn ^Yn} ^{(^xn,^yn)1-ρ} _Q^Yn ^{(^yn)ρ} ^{e-ϕ(ρ;_{P^Xn ^Yn}|_Q^Yn)},

and let_P^Bn|^An,ρbe a conditional additive channel defined by:

_P^Bn|^An,ρ(^an+^xn|^an)=_{P^Xn ^Yn,ρ}(^xn,^yn).

We also define the joint distribution of the message, the input, the output, and the decoded message for each channel:

_{P_Mn ^An ^Bn _M^n}(m,^an,^bn,m^):=1_Mn1[_en(m)=^an]_P^Bn|^An(^bn|^an)1[_dn(^bn)=m^],

_{P_Mn ^An ^Bn _M^n,ρ}(m,^an,^bn,m^):=1_Mn1[_en(m)=^an]_P^Bn|^An,ρ(^bn|^an)1[_dn(^bn)=m^].

For arbitraryρ˜∈R, letα:=_{P_Mn _M^n}{m≠m^}andβ:=_{P_Mn _M^n,ρ˜}{m≠m^}. Then, by the monotonicity of the Rényi divergence, we have:

s_D1+s(_{P^An ^Bn,ρ˜}∥_{P^An ^Bn})≥s_D1+s(_{P_Mn _M^n,ρ˜}∥_{P_Mn _M^n})

≥log^β1+s ^α-s+^(1-β)1+s ^(1-α)-s

≥log^β1+s ^α-s.

Thus, we have:

-logα≤s_D1+s(_{P^An ^Bn,ρ˜}∥_{P^An ^Bn})-(1+s)logβs.

Here, we have:

_D1+s(_{P^An ^Bn,ρ˜}∥_{P^An ^Bn})=_D1+s(_{P^Xn ^Yn,ρ˜}∥_{P^Xn ^Yn}).

On the other hand, from Lemma 25, we have:

1-β≤_{P^Xn ^Yn,ρ˜}log_Q^Yn(^yn)_{P^Xn ^Yn,ρ˜}(^xn,^yn)≤nlog|A|-γ+^eR ^enlog|A|-γ.

Thus, by the same argument as in (A111)-(A119) and by noting (A22), we can derive (222).

Now, we restrict the range ofρ˜so thatρ(a(R))<ρ˜<1and take:

σ=ρ(a(R))-ρ˜1-ρ˜.

Then, by noting (A22), we have (223).

Word count: 16175

Show less

© 2020. This work is licensed under http://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

We derive finite-length bounds for two problems with Markov chains: source coding with side-information where the source and side-information are a joint Markov chain and channel coding for channels with Markovian conditional additive noise. For this purpose, we point out two important aspects of finite-length analysis that must be argued when finite-length bounds are proposed. The first is the asymptotic tightness, and the other is the efficient computability of the bound. Then, we derive finite-length upper and lower bounds for the coding length in both settings such that their computational complexity is low. We argue the first of the above-mentioned aspects by deriving the large deviation bounds, the moderate deviation bounds, and second-order bounds for these two topics and show that these finite-length bounds achieve the asymptotic optimality in these senses. Several kinds of information measures for transition matrices are introduced for the purpose of this discussion.

Details

Title

Finite-Length Analyses for Source and Channel Coding on Markov Chains

Author

Hayashi, Masahito

; Watanabe, Shun

First page

460

Publication year

2020

Publication date

2020

Publisher

MDPI AG

e-ISSN

10994300

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/e22040460

ProQuest document ID

2393498020

Finite-Length Analyses for Source and Channel Coding on Markov Chains

Jump to:

Full text

Abstract

Details

Suggested sources