A compression algorithm for the combination of

Full text

Translate

Turn on search term navigation

http://crossmark.crossref.org/dialog/?doi=10.1140/epjc/s10052-015-3703-3&domain=pdf

Web End = http://crossmark.crossref.org/dialog/?doi=10.1140/epjc/s10052-015-3703-3&domain=pdf

Web End = Eur. Phys. J. C (2015) 75:474DOI 10.1140/epjc/s10052-015-3703-3

Regular Article - Theoretical Physics

http://crossmark.crossref.org/dialog/?doi=10.1140/epjc/s10052-015-3703-3&domain=pdf

Web End = http://crossmark.crossref.org/dialog/?doi=10.1140/epjc/s10052-015-3703-3&domain=pdf

Web End = A compression algorithm for the combination of PDF sets

Stefano Carrazza1, Jos I. Latorre2, Juan Rojo3,a, Graeme Watt4

1 Dipartimento di Fisica, Universit di Milano and INFN, Sezione di Milano, Via Celoria 16, 20133 Milan, Italy

2 Departament dEstructura i Constituents de la Matria, Universitat de Barcelona, Diagonal 647, 08028 Barcelona, Spain

3 Rudolf Peierls Centre for Theoretical Physics, University of Oxford, 1 Keble Road, Oxford OX1 3NP, UK

4 Institute for Particle Physics Phenomenology, Durham University, Durham DH1 3LE, UK

Received: 13 May 2015 / Accepted: 25 September 2015 The Author(s) 2015. This article is published with open access at Springerlink.com

Abstract The current PDF4LHC recommendation to estimate uncertainties due to parton distribution functions (PDFs) in theoretical predictions for LHC processes involves the combination of separate predictions computed using PDF sets from different groups, each of which comprises a relatively large number of either Hessian eigenvectors or Monte Carlo (MC) replicas. While many xed-order and parton shower programs allow the evaluation of PDF uncertainties for a single PDF set at no additional CPU cost, this feature is not universal, and, moreover, the a posteriori combination of the predictions using at least three different PDF sets is still required. In this work, we present a strategy for the statistical combination of individual PDF sets, based on the MC representation of Hessian sets, followed by a compression algorithm for the reduction of the number of MC replicas. We illustrate our strategy with the combination and compression of the recent NNPDF3.0, CT14 and MMHT14 NNLO PDF sets. The resulting compressed Monte Carlo PDF sets are validated at the level of parton luminosities and LHC inclusive cross sections and differential distributions. We determine that around 100 replicas provide an adequate representation of the probability distribution for the original combined PDF set, suitable for general applications to LHC phenomenology.

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . .

2 Combining PDF sets using the Monte Carlo method .2.1 Combination strategy . . . . . . . . . . . . . . .2.2 PDF dependence of benchmark LHC cross sections

3 The compression algorithm . . . . . . . . . . . . . .3.1 Compression: mathematical framework . . . . .3.2 A compression algorithm for Monte Carlo PDF sets

a e-mail: mailto:[email protected]

Web End [email protected]

3.2.1 Denition of the error function for the compression . . . . . . . . . . . . . . . .

3.2.2 Central values, variances, and higher moments3.2.3 The KolmogorovSmirnov distance . . . .3.2.4 PDF correlations . . . . . . . . . . . . . .3.2.5 Choice of GA parameters in compressor v1.0.0 . . . . . . . . . . . . . . . . . .

3.3 Results of the compression for native MC PDF sets

4 The compressed Monte Carlo PDF sets . . . . . . . .4.1 Compression of native MC PDF sets . . . . . . .4.2 Compression of the CMC-PDFs . . . . . . . . .

5 CMC-PDFs and LHC phenomenology . . . . . . . .5.1 LHC cross sections and differential distributions .5.2 Correlations between LHC cross sections . . . .6 Summary and delivery . . . . . . . . . . . . . . . . .A: The compression code . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . . . . . .

1 Introduction

Parton distribution functions (PDFs) are an essential ingredient for LHC phenomenology [17]. They are one of the limiting theory factors for the extraction of Higgs couplings from LHC data [8], they reduce the reach of many BSM searches, particularly in the high-mass region [911], and they are the dominant source of systematic uncertainty in precision electroweak measurements such as the W mass at the LHC [1214]. A crucial question is therefore how to estimate the total PDF uncertainty that affects the various processes listed above.

While modern PDF sets [1522] provide their own estimates of the associated PDF error, using a single set might not lead to a robust enough estimate of the total uncertainty arising from our imperfect knowledge of the PDFs in LHC computations. For instance, different global PDF sets, based

123

474 Page 2 of 32 Eur. Phys. J. C (2015) 75:474

on similar input datasets and theory assumptions, while in reasonable agreement, can still differ for some PDF avors and (x, Q2) regions by a non-negligible amount [4,5,23]. These differences are likely to arise from the different tting methodologies or from sources of theoretical uncertainty that are not yet accounted for, such as missing higher orders or parametric uncertainties. For these reasons, while an improved understanding of the origin of these differences is achieved, from the practical point of view it is necessary to combine different PDF sets to obtain a more reliable estimate of the total PDF uncertainty in LHC applications.

That was the motivation underlying the original 2010 recommendation from the PDF4LHC Working Group to compute the total PDF uncertainty in LHC processes [24,25]. The prescription was to take the envelope and midpoint of the three global sets available at the time (CTEQ6.6 [26], MSTW08 [27] and NNPDF2.0 [28]), each at their default value of s(MZ), and where each set included the combined PDF+s uncertainty using the corresponding prescription [2931]. This prescription has been updated [32] to the most recent sets from each group, and currently these are CT14 [22], MMHT14 [19] and NNPDF3.0 [16]. More recently, PDF4LHC has simplied the prescription for the combination of PDF+s uncertainties: the current recommendation [33] is now to take the three global sets at a common value of s(MZ) = 0.118, close enough to the most

recent PDG average [34], and then add in quadrature the additional uncertainty due to s. This procedure has been shown to be exact within the Gaussian approximation in the case of CT [29], and close enough to the exact prescription for practical applications in the cases of MMHT and NNPDF [30,31].

One criticism that has been raised to this PDF4LHC recommendation is that dening the total PDF uncertainty by the envelope of the predictions from different sets does not have a well-dened statistical interpretation. However, as originally proposed by Forte in [1], and developed in some more detail later by Forte and Watt in [2,35,36], it is possible to modify the PDF4LHC prescription to give the combination of PDF sets a robust statistical meaning as follows. The rst step consists in transforming the Hessian PDF sets into Monte Carlo (MC) PDF sets using the Watt and Thorne method [35]. Then one can consider that each of the replicas from each set is a different instance of a common probability distribution, thus the combination of the different sets can be achieved by simply adding together their Monte Carlo replicas. Assuming that each PDF set that enters the combination has the same a priori probability, the same number of replicas should be chosen from each set. The predictions from this combined Monte Carlo PDF set, which now clearly have a well-dened statistical meaning, turn out to be in reasonable agreement from those of the original envelope and midpoint method proposed by PDF4LHC. However, the resulting PDF uncer-

tainties will generally be slightly smaller, since the envelope method gives more weight to the outliers than the MC combination method.

In general, any method for the combination of PDF sets from different groups presents practical difculties at the implementation level. The rst one is purely computational: theoretical predictions have to be computed from all the eigenvectors/replicas of the various PDF sets, which in total require the same calculation to be redone around O (200)

times for the PDF4LHC envelope or around O (900) times

for the Monte Carlo combination, a very CPU-intensive task. Fortunately, some of the most widely used Monte Carlo event generators, such as MadGraph5_aMC@NLO [37,38] or POWHEG [39], and NNLO codes like FEWZ [40], now allow computation of PDF uncertainties at no extra cost. However, this is not the case for all the theory tools used for the LHC experiments, and even when this feature is available, in the case of the envelope method the a posteriori combination of the results obtained with the three sets still needs to be performed, which can be quite cumbersome (as well as error-prone) especially in the case of exclusive calculations that require very large event les.

The above discussion provides the motivation to develop new strategies for the combination of individual PDF sets, and the subsequent reduction to a small number of eigenvectors or replicas. One possible approach in this direction, the Meta-PDFs method, has been proposed in [41]. The basic idea is to t a common meta-parameterization to the PDFs from different groups at some common scale Q0, and then use the Monte Carlo combination of the different input sets to dene the 68 % condence-level intervals of these t parameters. A Meta-PDF set combining MSTW08 [27], CT10 [18] and NNPDF2.3 [42] at NNLO was produced in [41] based on Neig = 50 asymmetric eigenvectors. In addition, using

the dataset diagonalization method proposed in [43], it is possible to further reduce the number of eigenvectors in the Meta-PDF sets for specic physical applications, such as for Higgs production processes.

The main limitation of the Meta-PDF method is the possible dependence on the choice of input meta-parametrization. Indeed, the statement that the common parameterization that is used to ret all PDF sets is exible enough depends on which input sets enter in the combination, thus it needs to be checked and adjusted every time the procedure is repeated. In addition, at least for NNPDF, the Meta-PDF parameterization is bound to be insufcient, particularly in extrapolation regions like large-x, which are crucial for New Physics searches.

Recently, an alternative Hessian reduction approach, the MC2H method, has been developed [44]. This method adopts the MC replicas themselves as expansion basis, thus avoiding the need to choose a specic functional form. It uses Singular Value Decomposition methods with Principal Component

123

Eur. Phys. J. C (2015) 75:474 Page 3 of 32 474

Analysis to construct a representation of the PDF covariance matrix as a linear combination of MC replicas. The main advantage of the MC2H method is that the construction is exact, meaning that the accuracy of the new Hessian representation is only limited by machine precision. In practice, eigenvectors which carry little information are discarded, but even so with Neig = 100 eigenvectors central values and

covariances of the prior combination can be reproduced with

O(0.1 %) accuracy or better.

However, a central limitation of any Hessian reduction method is the impossibility of reproducing non-Gaussian features present in the input combination. It should be noted that even in the case where all the input sets in the combination are approximately Gaussian, their combination in general will be non-Gaussian. This is particularly relevant in extrapolation regions where PDF uncertainties are large and the underlying probability distributions for the PDFs are far from Gaussian. Failing to reproduce non-Gaussianities implies that the assumption of equal prior likelihood of the individual sets that enter the combination is articially modied: for instance, if two sets peak at some value and another one at some other value (so we have a double hump structure), a Gaussian reduction effectively will be adding more weight to the second set as compared to the rst two. To overcome this limitation is the main motivation for this work, where we propose an alternative reduction strategy based on the compression of the original Monte Carlo combined set into a smaller subset of replicas, which, however, reproduces the main statistical features of the input distribution.

The starting point of our method is, as in the case of the Meta-PDF and MC2H methods, the Monte Carlo combination of individual PDF sets, and then a compression algorithm follows in order to select a reduced number of replicas while reproducing the basic statistical properties of the original probability distribution, such as means, variances, correlations, and higher moments. This compression is based on the genetic algorithms (GA) exploration of the space of minima of suitably dened error functions, a similar strategy as that used for the neural network training in the NNPDF ts [45,46]. The resulting compressed Monte Carlo PDFs, or CMC-PDFs for short, are then validated for a wide variety of LHC observables, both at the level of inclusive cross sections, differential distributions, and correlations, nding that using around Nrep = 100 replicas are enough to reproduce

the original results for all the processes we have considered.

Another important application of the compression algorithm is to native Monte Carlo PDF sets. For instance, in the NNPDF framework, a large number of replicas, around Nrep = 1000, are required to reproduce ne details of

the underlying probability distribution such as small correlations. Therefore, we can apply the same compression algorithm also to native MC PDF sets, and end up with a much smaller number of replicas conveying the same infor-

mation as the original probability distribution. Therefore, in this work we will also present results of this compression of the NNPDF3.0 NLO Nrep = 1000 set. Note that

despite the availability of the compressed sets, PDF sets with Nrep = 1000 replicas are still needed for other applications,

for instance for Bayesian reweighting [47,48].

The outline of this paper is as follows. First of all in Sect. 2 we review the Monte Carlo method for the combination of individual PDF sets, and we present results for the combination of the NNPDF3.0, MMHT14, and CT14 NNLO, both at the level of PDFs and for selected benchmark LHC cross sections. Then in Sect. 3 we describe the compression algorithm used to reduce the number of replicas of a MC PDF set. Following this, in Sect. 4 we present our main results for the CMC-PDFs, and validate our approach for the PDF central values, variances and correlations, together with selected parton luminosities. We also validate the compression of native MC sets, in particular using NNPDF3.0 NLO with Nrep = 1000 replicas. Then in Sect. 5 we perform

the validation of the CMC-PDFs at the level of LHC cross sections and differential distributions. Finally, in Sect. 6 we summarize and discuss the delivery of our results, both for the CMC-PDFs to be made available in LHAPDF6 [49] and for the compression code, which is also made publicly available [50]. Appendix contains a concise user manual for the compression code, which allows construction of CMC-PDFs starting from an arbitrary input combination of PDF sets.

The detailed comparison of the CMC-PDFs with those of the Meta-PDF and MC2H methods will be presented in the upcoming PDF4LHC report with the recommendations about PDF usage at Run II.

2 Combining PDF sets using the Monte Carlo method

In this section we review the Monte Carlo method for combination of different PDF sets, and we provide results for the combination of the recent NNPDF3.0, CT14, and MMHT14 NNLO PDF sets. We then compare this combined PDF set with the predictions from the three individual sets for a number of benchmark LHC inclusive cross sections and their correlations.

2.1 Combination strategy

Our starting point is the same as that originally suggested by Forte in Ref. [1]. First of all we decide which sets enter the combination, then transform the Hessian sets into a Monte Carlo representation using the Watt and Thorne method [35] and nally combine the desired number of replicas from each set to construct the joint probability distribution of the combination. This strategy was already used in [2,35,36] to compare the predictions of the Monte Carlo combination of PDF

123

474 Page 4 of 32 Eur. Phys. J. C (2015) 75:474

sets with those of the original PDF4LHC envelope recommendation [24,25].

Let us recall that a Monte Carlo representation for a Hessian set can be constructed [35] by generating a multi-Gaussian distribution in the space of t parameters, with mean value corresponding to the best-t result, and with width determined by the Hessian matrix. This is most efciently done in the basis where the Hessian matrix is diagonal, and in this case Monte Carlo replicas can be generated using

Fk = F(q0)+

[summationdisplay] F(q+j)F(qj)[bracketrightBig] Rkj, k = 1, . . . , Nrep,

(1)

where q0 and qj are, respectively, the best-t and the asymmetric jth eigenvector PDF member, and Rkj are univariate Gaussian random numbers. For most practical applications, Nrep = 100 are enough to provide an

accurate representation of the original Hessian set [35]. In this work we use the LHAPDF6 [49] implementation1 of Eq. (1). In particular, we use the LHAPDF6 program examples/hessian2replicas.cc to convert an entire Hessian set into its corresponding MC representation. In Eq. (1) the quantity F represents the value of a particular PDF at (x, Q) and avors corresponding to the original LHAPDF6 grids.

Once Hessian PDF sets have been converted into their Monte Carlo representations, one needs to decide how many replicas N(i)rep of each PDF set i will be included in the combination. The combined probability distribution is simply

P = [summationtext]ni=1 wi Pi, where Pi (i = 1, . . . , n) are the prob

ability distributions for each of the n individual PDF sets and the weights wi = N(i)rep/[tildewide]

ni=1 i and variance

2 [summationtext]ni=1 [parenleftbig]2i + 2i[parenrightbig] /n 2. The equality only holds

when the three input distributions are Gaussian, which in the case of NNPDF is approximately true in the experimental data region.

In this MC combination strategy, which is a common ingredient of the CMC-PDF, Meta-PDF, and MC2H methods, the theoretical inputs from each PDF group, like the method of solution of the DGLAP evolution equations, or the values of the heavy-quark masses, are not modied. Given that the current MC combination is based on PDF sets with different choices of the heavy-quark masses mc and mb, and different heavy-quark schemes, for applications which depend sizably on the values of the heavy-quark masses and/or of the PDFs close to the heavy-quark thresholds, one should use the individual PDF sets rather than their combination. This might, however, change in future combinations if these are based on PDF sets with common settings for the treatment of heavy quarks.

While the starting point is common, the differences between the three reduction methods arises in the strategies adopted to decrease the number of error PDF sets in the combination, which is achieved by compressing the MC representation (CMC-PDFs) or by constructing a Hessian representation, based either on a meta-parametrization (Meta-PDFs) or in a linear expansion over the MC replicas themselves (MC2H). In the Meta-PDF approach [41], common theory settings are used to evolve upwards the meta-parameterization starting from Q0 = 8 GeV using HOP

PET [51], while CMC-PDF and MC2H maintain the original theory settings of each individual PDF set. It has been concluded, following a careful benchmarking between the two groups, that both options provide an adequate enough representation of the MC prior for Q > mb, and in any case the current combined PDFs should not be used for Q [lessorsimilar] mb.

In Fig. 1 we show the comparison of the individual PDF sets, NNPDF3.0, CT14, and MMHT14, with their Monte Carlo combination with

Nrep = 900. In the following, we will

denote by MC900 this prior combination. The comparison is performed at a typical LHC scale of Q = 100 GeV, and the

PDFs are normalized to the central value of the combined set. As can be seen there is reasonable agreement between

Neig

j=1

Nrep (i = 1, . . . , n), where ni=1 wi = 1 and [tildewide]

Nrep = [summationtext]ni=1 N(i)rep is the total number

of replicas. The simplest case, corresponding to an equal degree of belief in the predictions from each of the PDF sets in the combination, is to use the same number of replicas, say N(i)rep = 300, from each set. This approach is justied in the

case of ts based on a similar global dataset and comparable theory inputs, as will be the case in this work. Choosing the correct value of N(i)rep for sets based on a reduced dataset, or with very different theory inputs, is a more complex problem which is not discussed here. Note that taking the average over a large number of Monte Carlo replicas generated using Eq. (1) will recover the best-t PDF member F(q0) only up to statistical uctuations.

1 Note that Eq. (6.5) of Ref. [35] and the current LHAPDF 6.1.5 code contain a mistake which has been corrected in the Mercurial repository and the correction will be included in the upcoming LHAPDF6.1.6 release; see Eq. (22) of Ref. [49].

Using this Monte Carlo combination method, we have produced a combined set with

Nrep = 900 replicas from

adding together N(i)rep = 300 replicas of the NNPDF3.0, CT14

and MMHT14 NNLO sets. Study of the properties of the prior with respect

Nrep shows that at least 900 replicas are required to eliminate the statistical uctuations from Eq. (1) down to an acceptable level. For the three groups we use a common value of s(MZ) = 0.118. One requirement for

the validation of this procedure is that the combination of the same number of instances of n different probability distributions should have mean

1 n

123

Eur. Phys. J. C (2015) 75:474 Page 5 of 32 474

NNLO,

1.25

MC900

NNPDF3.0

CT14

MMHT14

=0.118, Q = 100 GeV

MC900

NNPDF3.0

CT14

MMHT14

=0.118, Q = 100 GeV

1.2

1.15

(x,Q)

(ref)

g(x,Q) / g

(x,Q)

(ref)

u(x,Q) / u

1.1

1.05

0.95

0.9

0.85

10 4

10 2

10 1

10 4

10 2

10 1

NNLO,

=0.118, Q = 100 GeV

NNLO,

=0.118, Q = 100 GeV

1.25

MC900

NNPDF3.0

CT14

MMHT14

1.3 MC900 NNPDF3.0

CT14

MMHT14

1.2

1.15

1.2

(x,Q)

(ref)

d(x,Q) / d

1.05

(x,Q)

(ref)

s(x,Q) / s

1.1

0.95

0.9

0.85

0.8

10 2

10 1

10 4

10 2

10 1

Fig. 1 Comparison of the individual NNPDF3.0, CT14 and MMHT14 NNLO sets with the corresponding Monte Carlo combination MC900. The comparison is performed at a typical LHC scale of Q = 100 GeV, and the PDFs are normalized to the central value of the combined set MC900

the three individual sets, and the resulting combined set is a good measure of their common overlap. Note that at large-x differences between the three sets are rather marked, and we expect the resulting combined probability distribution to be rather non-Gaussian.

In Fig. 2 we show the histograms representing the distribution of Monte Carlo replicas in the individual PDF sets and in the combined set, for different avors and values of (x, Q). From top to bottom and from left to right we show the gluon at x = 0.01 (relevant for Higgs production in gluon

fusion), the up quark at x = 5 105 (at the lower edge of

the region covered by HERA data), the down antiquark for x = 0.2 (relevant for high-mass searches) and the strange

PDF for x = 0.05 (accessible at the LHC through W+charm

production). All PDFs have been evaluated at Q = 100 GeV.

The histograms for the MC900 prior allow us to determine in each case how close the combined distribution is to a nor-

mal distribution, by comparison with a Gaussian computed using the same mean and variance of the MC900 set. From this comparison in Fig. 2, we see that while in some cases the underlying distribution of the MC900 PDFs is reasonably Gaussian, like for g(x = 0.01) and u(x = 5 105),

in others, for d(x = 0.2) and s(x = 0.05), the Gaussian

approximation is not satisfactory. Deviations from a Gaussian distribution are in general more important for PDFs in extrapolation regions with limited experimental information.

Concerning the treatment of the PDF+s uncertainties, the updated PDF4LHC recommendation [33] proposes a simplied prescription based on the addition in quadrature of the separated PDF and s uncertainties, based on the realization that this always gives approximately the same answer as more sophisticated methods, and in some procedures exactly the same answer. In the case of the Monte Carlo combination, this prescription can be implemented by

123

474 Page 6 of 32 Eur. Phys. J. C (2015) 75:474

g( x=0.01, Q=100 GeV )

-5 , Q=100 GeV )

u( x=5 10

0.7

0.6

NNPDF3.0 CT14 MMHT14 MC900 Gaussian

0.6

0.5

NNPDF3.0 CT14 MMHT14 MC900 Gaussian

0.5

Probability per bin

0.4

0.3

0.2

0.1

0 7.4 7.6 7.8 8 8.2 8.4

0 3.8 4 4.2 4.4 4.6 4.8

x*PDF( x ,Q )

( x=0.20, Q=100 GeV )

s( x=0.05, Q=100 GeV )

0.6

NNPDF3.0 CT14 MMHT14 MC900 Gaussian

0.5

NNPDF3.0 CT14 MMHT14 MC900 Gaussian

Probability per bin

0.4

0.3

0.2

0.1

0 0.026 0.028 0.03 0.032 0.034 0.036 0.038 0.04 0.042 0.044 0.046

0 0.08 0.1 0.12 0.14 0.16 0.18 0.2

x*PDF( x ,Q )

Fig. 2 Histograms representing the probability distribution of Monte Carlo replicas for both the individual PDF sets and for the combined set, for different avors and values of (x, Q). From top to bottom and from left to right we show the gluon at x = 0.01, the up quark at x = 5105,

the down antiquark for x = 0.5, and the strange PDF for x = 0.05. All

PDFs have been evaluated at Q = 100 GeV. A Gaussian distribution

computed with from the mean and variance of the MC900 prior is also shown

simply constructing the central values of the MC900 prior with, say, s(MZ) = 0.1165 and s(MZ) = 0.1195 as the

mean of the central values of the NNPDF3.0, MMHT14, and CT14 sets, each with the corresponding value of s(MZ).

Half of the spread of the predictions computed with the central values of the CMC-PDFs with s(MZ) = 0.1165

and s(MZ) = 0.1195 denes then the one-sigma s

uncertainty. This assumes s(MZ) = 0.118 0.0015 as an

external input, but a different value of s can be implemented by a simple rescaling. Note also that for the MC900 sets with s(MZ) = 0.118, only the central values are

required.

2.2 PDF dependence of benchmark LHC cross sections

As stated in the introduction, the goal of this work is to compress the MC900 prior by roughly an order of magnitude,

from the starting

Nrep = 900 to at least Nrep 100, and to

validate the results of this compression for a number of LHC observables. In Sect. 4 we will show the results of applying the compression strategy of Sect. 3 to the combined MC set. But rst let us explore how the predictions from MC900 prior compare with the individual PDF sets for a variety of LHC cross sections. We also compare the correlations between physical observables for the individual PDF sets to their combination.

In the following we consider a number of NNLO inclusive cross sections: Higgs production in gluon fusion, computed using ggHiggs [52], top-quark pair production, using top++ [53], and inclusive W and Z production, using VRAP [54]. In all cases we use the default settings in each of these codes, since our goal is to study similarities and differences between the predictions of each of the PDF sets, for xed theory settings.

123

Eur. Phys. J. C (2015) 75:474 Page 7 of 32 474

ggH, ggHiggs NNLO, LHC 13 TeV,

=0.118

+ , VRAP NNLO, LHC 13 TeV,

=0.118

NNPDF3.0 MMHT14 CT14 MC900 Envelope

7.8

44.5

7.6

Cross-Section (pb)

43.5

7.4

Cross-Section (nb)

42.5

7.2

41.5

6.8

- , VRAP NNLO, LHC 13 TeV,

=0.118

0 , VRAP NNLO, LHC 13 TeV,

=0.118

NNPDF3.0 MMHT14 CT14 MC900 Envelope

1.32

5.9

1.3

5.8

NNPDF3.0 MMHT14 CT14 MC900 Envelope

Cross-Section (nb)

5.7

1.28

5.6

1.26

5.5

1.24

5.4

1.22

5.3

1.2

5.2

1.18

5.1

1.16

Top Pair production, top++ NNLO, LHC 13 TeV,

=0.118

870

NNPDF3.0 MMHT14 CT14 MC900 Envelope

860

850

Cross-Section (pb)

840

830

820

810

800

790

780

770

Fig. 3 Comparison of the predictions from the NNPDF3.0, MMHT14 and CT14 NNLO sets, with those of their Monte Carlo combination MC900, for a number of inclusive benchmark LHC cross sections. For illustration, we also indicate the envelope of the predictions of the three different PDF sets, which would determine the total PDF uncertainty in

the current PDF4LHC recommendation. From top to bottom and from left to right: Higgs production in gluon fusion, W+, W, and Z production, and top quark pair production. All processes have been computed at the LHC with a center-of-mass energy of 13 TeV

The results for these inclusive cross sections are shown in Fig. 3. We also show with dashed lines the envelope of the one-sigma range obtained from the three individual sets, which would correspond to the total PDF uncertainty for this process if obtained following the present PDF4LHC recommendation. We see that in general the two methods, the MC

combination and the envelope, give similar results, the former leading to a smaller estimate of the total PDF uncertainty since the envelope assigns more weight to outliers than what would be required on a statistical basis.

It is also useful to compare the correlations between LHC cross sections computed with the individual PDF sets and

123

474 Page 8 of 32 Eur. Phys. J. C (2015) 75:474

( gg->h , tT )

( gg->h , W

)

NNPDF3.0 MMHT14 CT14 MC900

Correlation coefficient

0.5

Correlation coefficient

0.5

( gg->h , WW )

( W+h , gg->h )

NNPDF3.0 MMHT14 CT14 MC900

Correlation coefficient

0.5

( W+h , tT )

+ , tT )

( W

NNPDF3.0 MMHT14 CT14 MC900

Correlation coefficient

0.5

Fig. 4 Comparison of the correlation coefcients between a number of representative NLO and NNLO LHC inclusive cross sections computed from the three individual sets, NNPDF3.0, CT14, and MMHT14

(using the MC representation for the Hessian sets), and with their MC combination MC900

with the MC900 combined set. A representative set of these correlations is shown in Fig. 4, computed using the same settings as above. In addition to the processes shown in Fig. 3, here we also show correlations for the W W and Wh production NLO total cross sections computed with MFCM. For MMHT14 and CT14, correlations are computed from their Monte Carlo representation.

From the comparison of the correlation coefcients shown in Fig. 4 we note that the correlation coefcients between LHC cross sections for the three global sets, NNPDF3.0, CT14, and MMHT14, can differ substantially more than for central values and variances. This effect was also noticed in the Higgs Cross-Section Working Group study of PDF-induced correlations between Higgs production chan-

123

Eur. Phys. J. C (2015) 75:474 Page 9 of 32 474

nels [55]. By construction, the correlation coefcient for the combined MC prior produces the correct weighted average of the correlations from the individual sets.

3 The compression algorithm

In the previous section we have described and validated the combination of different PDF sets, based on the Monte Carlo method. We have shown that the probability distribution of such a combined PDF set can be represented by

Nrep Monte Carlo replicas. Now in this section we introduce a compression algorithm that aims to determine, for a xed (smaller) number of MC replicas Nrep < [tildewide]

tical distance among probability distributions. In particular, the Kolgomorov distance

K (

q) = [summationdisplay]

|pi qi|, i = 1, . . . , n, (2)

where the index i runs over the number of instances of

is a simple and powerful example of a gure of merit that quanties how different a probability distribution is from another one.

With the advent of Information Theory, Shannon introduced the concept of surprise of a probability distribution as its distance to the even prior. This can be characterized using Shannon entropy S(

p) = [summationtext]i pi log pi. It is, then, natural

to quantify distinguishability between two probability distributions

p and

q using entropy concepts [57]. This leads to the construction of the KullbackLeibler divergence

p||

q) = [summationdisplay]

i=1,...,n

Nrep, the optimal subset of the original representation that most faithfully reproduces the statistical properties of the combined PDF prior distribution.

First of all, we begin with a presentation of the mathematical problem, followed by a description of the technical aspects of the compression strategy, where we describe the choice of error function and related parameters that have been chosen in this work. Then we apply the compression method to the combined Monte Carlo PDFs, producing what will be dubbed as CMC-PDFs in the rest of this paper. We also show how the compression strategy can be applied to native Monte Carlo PDF sets, using the NNPDF3.0 NLO set with

Nrep = 1000 as an illustration. The validation of the

compression at the level of parton distributions and physical observables is then performed in Sects. 4 and 5.

3.1 Compression: mathematical framework

Let us begin by presenting an overview of the mathematical framework for the problem that we aim to address, namely the compression of a given probability distribution function. The starting point is to consider a representation of a probability distribution

p = (p1, . . . , pn), using a nite number

n of instances. In the case at hand, the number of instances is given by the number of Monte Carlo replicas

Nrep. Any

smaller number set of replicas, Nrep < [tildewide]

Nrep, produces a corresponding probability distribution

q, which entails a loss of information with respect to the original distribution

pi , (3)

which differs from the Kolmogorov distance in the sense that it weights more the largest probabilities. Later renements produced the ideas of symmetric statistical distances, like the symmetrized Kullback and the Chernhoff distances, used in Quantum Information nowadays. As a consequence of these trend of ideas, it is clear there are very many well-studied options to dene a distance in probability space. Since their variations are not large, any of them should be suitable for the problem of Monte Carlo PDF compression, and we present our specic choice in Sect. 3.2.

Let us now be more precise on the way we shall proceed. If we dene { p} as the original representation of the probability

distribution (with

Nrep replicas) and { q} its compressed ver

sion (with Nrep replicas), then given the concept of a distance d between two probability distributions there is an optimal choice of the subset with Nrep replicas dened as

{ q}opt Min{ q} [bracketleftbig]d ({ q}, { p})[bracketrightbig] . (4)

Therefore, the mathematical problem at stake is reduced to nding the optimal subset { q}opt, by a suitable exploration

of the space of minima of the distance d ({ q}, { p}). In this

work, this exploration is performed using genetic algorithms, though many other choices would also be suitable. Fortunately, many choices of subset are equally good minimizations. From the practical point of view, the specic choice of the minimization strategy is not critical. It is clear that the relevant point is the denition of a distance between the original and compressed replica sets. In this paper we shall take the following approach.

Many valid denitions of statistical distance differ in the way different moments are weighted. Since we are inter-

qi log qi

The problem of optimal compression can be mathematically stated as follows. We would like to nd the specic subset of the original set of replicas such that the statistical distance between the original and the compressed probability distributions is minimal. In other words, we look for a subset of replicas that delivers a probability distribution as indistinguishable from the prior set as possible.

A number of different gures of merit to quantify the distinguishability of probability distribution were proposed many decades ago. Some of the rst efforts are accounted in the book of Hardy et al. [56], where ideas about strong ordering (majorization) were introduced. Later on, the problem of distinguishability was quantied using the concept of statis-

123

474 Page 10 of 32 Eur. Phys. J. C (2015) 75:474

ested in reproducing real physics, which is dominated by low moments, we shall explicitly include in our gure of merit all the distances between means and standard deviations, but also kurtosis, skewness and correlations, as well as higher moments. As a consequence, all of them will be minimized, favoring the role of smaller moments.

3.2 A compression algorithm for Monte Carlo PDF sets

As we have discussed above, the most important ingredient for the compression strategy is the choice of a suitable distance between the prior and the compressed distributions, Eq. (4), or in other words, the denition of the error function (ERF in the following) for the minimization problem. We have explored different possibilities, and the precise denition of the ERF that will be used in this work can be written generically as follows:

ERF = [summationdisplay]

1Nk [summationdisplay]

C(k)i O(k)iO(k)i

, (5)

where k runs over the number of statistical estimators used to quantify the distance between the original and compressed distributions, Nk is a normalization factor, O(k)i is the value of the estimator k (for example, the mean or the variance) computed at the generic point i (which could be a given value of (x, Q) in the PDFs, for instance), and C(k)i is the corresponding value of the same estimator in the compressed set. The choice of a normalized ERF is important for the accuracy of the minimization because some statistical estimators, in particular higher moments, can span various orders of magnitude in different regions of x and Q2.

An schematic diagram for our compression strategy is shown in Fig. 5. The prior set of Monte Carlo PDF replicas, the desired number of compressed replicas, Nrep, and the value of the factorization scale Q at which the PDFs are evaluated, Q0, are the required parameters for the compression algorithm. Note that it is enough to sample the PDFs in a range of values of Bjorken-x at a xed value of Q0, since the DGLAP equation uniquely determines the evolution for higher scales Q Q0. The minimization of the error func

tion is performed using genetic algorithms (GAs), similarly as in the neural network training of the NNPDF ts. GAs work as usual by nding candidates for subsets of Nrep leading to smaller values of the error function Eq. (5) until some suitable convergence criterion is satised. The output of this algorithm is thus the list of the Nrep replicas from the prior set of

Nrep that minimize the error function. These replicas dene the CMC-PDFs for each specic value of Nrep. The nal step of the process is a series of validation tests where the CMC-PDFs are compared to the prior set in terms of parton distributions at different scales, luminosities, and LHC cross sections, in a fully automated way.

Fig. 5 Schematic representation of the compression strategy used in this work: a prior PDF set and the number of compressed replicas is the input of a GA algorithm which selects the best subset of replicas which minimizes the ERF between the prior and the compressed set

It is important to emphasize that the compression algorithm only selects replicas from a prior set, and no attempt is made to use common theoretical settings, i.e., the method for the solution of the DGLAP evolution equations, or the values of the heavy-quark masses, which are those of the corresponding original PDF sets. This important fact automatically ensures that the compressed set conserves all basic physical requirements of the original combined set such as the positivity of physical cross sections, sum rules, and the original correlations between PDFs. To avoid problems related to the different treatment of the heavy-quark thresholds between the different groups, we choose in this work to compress the combined MC PDF set at a common scale of Q0 = 2 GeV, while we use Q0 = 1 GeV when compressing

the native NNPDF3.0 NLO set.

The compression strategy seems conceptually simple: reducing the size of a Monte Carlo PDF set requiring no substantial loss of information. In order to achieve its goal, the compression algorithm must preserve as much as possible the underlying statistical properties of the prior PDF set. However, this conceptual simplicity is followed by a series of non-trivial issues that have to be addressed in the practical implementation. Some of these issues are the sampling of the PDFs in Bjorken-x, the exact denition of the error function, Eq. (5), the treatment of PDF correlations and the settings of the GA minimization. We now discuss these various issues in turn.

3.2.1 Denition of the error function for the compression

In this work we include in the ERF, Eq. (5), the distances between the prior and the compressed sets of PDFs for the following estimators:

123

Eur. Phys. J. C (2015) 75:474 Page 11 of 32 474

The rst four moments of the distribution, which are sam

pled in a grid of x points for n f avors in terms of central value, standard deviation, skewness, and kurtosis, at a xed value of Q = Q0. It is important to notice that

these estimators are necessary in order to obtain a realistic and optimized compressed MC set, but are not sufcient to avoid eventual bias of continuity and loss of structure.

The output of the KolmogorovSmirnov test. This is the

simplest distance between empirical probability distributions. This distance complements the terms in the ERF which contain the rst four moments, by ensuring that also higher moments are automatically adjusted. However, if this estimator is used alone possible ambiguities arise when dening the regions where the distance is computed, leading to large errors when working with few replicas.

The correlation between multiple PDF avors at different

x points. This information is important for ensuring that PDF-induced correlations in physical cross sections are successfully maintained.

The nal gure of merit used in the compression t is then the sum over all these six estimators opportunely weighted by the corresponding normalization factors Nk in Eq. (5).

This normalization is required due to the fact that the absolute value of the various estimators can vary among them by several orders of magnitude.

3.2.2 Central values, variances, and higher moments

Lets denote by g(k)i(x j , Q0) and f (r)i(x j , Q0), respectively, the prior and the compressed sets of replicas for a avor i at the position j of the x-grid containing Nx points. Nrep is the number of required compressed replicas. We then dene the contribution to the ERF from the distances between central values of the prior and compressed distributions as follows:

ERFCV =

1 NCV

while for the prior set we have

gCVi(x j , Q0) =

Nrep

g(k)i(x j , Q0). (8)

Let us also dene rti(x j , Q0) as a random set of replicas extracted from the prior set, where t identies an ensemble of random extractions. The number of random extraction of random sets is denoted by Nrand.

Now, the normalization factors are extracted for all estimators as the lower 68 % condence-level value obtained after Nrand realizations of random sets. In particular for this estimator we have

NCV =

1 Nrand

[summationdisplay]

k=1

[summationdisplay]

d=1

n f

[summationdisplay]

i=n f

[summationdisplay]

j=1

Nrand

[vextendsingle]

[vextendsingle][vextendsingle][vextendsingle][vextendsingle]68 % lower band

(9)

For the contribution to the ERF from the distance between standard deviation, skewness, and kurtosis, we can build expressions analogous to that of Eq. (6) by replacing the central value estimator with the suitable expression for the other statistical estimators, which in a Monte Carlo representation can be computed as

f STDi(x j , Q0)

[radicaltp]

[radicalvertex]

[radicalbt]

rd,CVi(x j , Q0) gCVi(x j , Q0) gCVi(x j , Q0)

1 Nrep 1

Nrep

[summationdisplay] [parenleftBig] f (r)i(x j , Q0) f CVi(x j , Q0)[parenrightBig]2,

(10)

f SKEi(x j , Q0) =

1 Nrep

Nrep

[summationdisplay]

r=1

n f

[summationdisplay]

i=n f

[summationdisplay]

j=1

[parenleftBigg] f CVi(x j , Q0) gCVi(x j , Q0) gCVi(x j , Q0)

[parenleftBig] f (r)i(x j , Q0) f CVi(x j , Q0)[parenrightBig]3 [slashBig][parenleftBig] f STDi(x j , Q0)[parenrightBig]3 ,(11)

f KURi(x j , Q0) =

1 Nrep

Nrep

(6)

where NCV is the normalization factor for this estimator. We only include in the sum those points for which the denominator satises gCVi(x j , Q0) = 0. As usual, central values

are computed as the average over the MC replicas, for the compressed set

f CVi(x j , Q0) =

1 Nrep

Nrep

[summationdisplay]

r=1

[parenleftBig] f (r)i(x j , Q0) f CVi(x j , Q0)[parenrightBig]4 [slashBig][parenleftBig] f STDi(x j , Q0)[parenrightBig]4 ,(12)

for the compressed set, with analogous expressions for the original prior set.

The normalization factors for these estimators are extracted using the same strategy presented in Eq. (9), by averaging

[summationdisplay]

r=1

f (r)i(x j , Q0), (7)

123

474 Page 12 of 32 Eur. Phys. J. C (2015) 75:474

over random extractions of Nrep replicas, exchanging CV by STD, SKE, and KUR, respectively.

3.2.3 The KolmogorovSmirnov distance

As we have mentioned above, the minimization of the KolmogorovSmirnov distance ensures that both lower and higher moments of the prior distribution are successfully reproduced. In our case, we dene the contribution to the total ERF from the KolmogorovSmirnov (KS) distance as follows:

ERFKS =

1 NKS

by means of the trace method. We dene a correlation matrix C for any PDF set as follows:

Ci j =

Nrep

Nrep 1

i j i j

i j

, (16)

where we have dened

i =

1 Nrep

Nrep

[summationdisplay]

r=1

f (r)i(xi, Q0),

i j =

1 Nrep

Nrep

[summationdisplay]

r=1

f (r)i(xi, Q0) f (r)j(x j , Q0),

(17)

n f

[summationdisplay]

i=n f

[summationdisplay]

j=1

(r)

[summationdisplay]

k=1

[parenleftBigg] Fki(x j , Q0) Gki(x j , Q0)Gki(x j , Q0)

(13)

where Fki(x j , Q0) and Gki(x j , Q0) are the outputs of the test for the compressed and the prior set of replicas, respectively.

The output of the test consists in counting the number of replicas contained in the k regions where the test is performed. We count the number of replicas which fall in each region and then we normalize by the total number of replicas of the respective set. Here we have considered six regions dened as multiples of the standard deviation of the distribution for each avor i and x j -point. As an example for the compressed set, the regions are

[bracketleftbig] , 2 f STDi(x j , Q0), f STDi(x j , Q0), 0, f STDi(x j , Q0), 2 f STDi(x j , Q0), +[bracketrightbig], (14)

where the values of the PDFs have been subtracted from the corresponding central value.

In this case, the normalization factor is determined from the output of the KS test for random sets of replicas extracted from the prior, denoted Rki(x j , Q0) as follows:

NKS =

1 Nrand

and is the usual expression for the standard deviation

i =

[radicaltp]

[radicalvertex]

[radicalbt]

1 Nrep 1

Nrep

[summationdisplay] [parenleftBig] f (r)i(xi, Q0) i [parenrightBig]2. (18)

Now, for each avor n f we dene Ncorrx points distributed in x where the correlations are computed. The trace method consists in computing the correlation matrix P based on Eq. (16) for the prior set and then store its inverse P1. For n f avors and Ncorrx points we obtain

g = Tr(P P1) = Ncorrx (2 n f + 1). (19)

After computing the correlation matrix for prior set, for each compressed set a matrix C is computed and the trace is determined by

f = Tr(C P1). (20)

The compression algorithm then includes the correlation ERF by minimizing the quantity:

ERFCorr =

1 NCorr

[parenleftbigg] f g

(21)

[summationdisplay]

d=1

n f

[summationdisplay]

i=n f

[summationdisplay]

j=1

[summationdisplay]

k=1

Nrand

where NCorr is computed as usual from the random sets, in the same way as Eq. (9).

3.2.5 Choice of GA parameters in compressor v1.0.0

The general strategy that has been presented in this section has been implemented in compressor v1.0.0, the name of the public code [50] released together with this paper. A more detailed description of the code usage is provided in the appendix. The availability of this code ensures that it will be possible to easily redo the compression for any further combination of PDF sets that might be considered in the future.

[parenleftBigg] Rki(x j , Q0) Gki(x j , Q0)Gki(x j , Q0)

, (15)

and we only include in the sum those points for which the denominator satises Gki(x j , Q0) = 0.

3.2.4 PDF correlations

In addition to all the moments of the prior distribution, a sensible compression should also maintain the correlations between values of x and between avors of the PDFs. In order to achieve this, correlations are taken into account in the ERF

123

Eur. Phys. J. C (2015) 75:474 Page 13 of 32 474

Table 1 (a) Setting of the compression algorithm used in this work. (b)

Mutation rates used in the genetic algorithm minimization

compressor v1.0.0

(a) GA Parameters

Nmaxgen 15,000

Nmut 5

Nx 70

xmin 105

xmax 0.9

n f 7

Q0 User-dened Ncorrx 5

Nrand 1000

(b)

Nmutrep Pmut (%)

1 30

2 30

3 10

4 30

Total error function minimization - 1000 replicas prior

10 compressed replicas

20 compressed replicas

30 compressed replicas

40 compressed replicas

50 compressed replicas

60 compressed replicas

70 compressed replicas

80 compressed replicas

90 compressed replicas

TOT

ERF

0 2000 4000 6000 8000 10000 12000 14000

Fig. 6 The value of the total error function, Eq. (5), for the compression of the 1000 replica set of NNPDF3.0 NLO, as a function of the number of GA generations, for different values of the number of replicas in the compressed set Nrep. After 15k iterations, the error function saturates and no further improvement of the error function would be achieved for longer training

neural network learning in the NNPDF ts, in the compression problem there is no risk of over-learning, since the absolute minimum of the error function always exists. On the other hand, we nd that after a few thousand generations the ERF saturates and no further improvements are achieving by running the code longer, hence the maximum number of GA generations Nmaxgen = 15k used in this work.

3.3 Results of the compression for native MC PDF sets

In order to illustrate the performance of the compression algorithm, we consider here the compression of a native Monte Carlo set of PDFs at Q0 = 1 GeV, based on the

prior set with

Nrep = 1000 replicas of NNPDF3.0 NLO.

In Fig. 6 we show the dependence of the total ERF as a function of the number of iterations of the GA for Nrep = 10, 20, 30, 40, 50, 60, 70, 80, and 90. We observe

that the rst 1k iterations are extremely important during the minimization, while after 15k iterations the total error function is essentially at for any required number of compressed replicas. For each compression, the nal value of the error function is different, with deeper minima being achieved as we increase the number of compressed replicas, as expected. The atness of the ERF as a function of the number of iterations conrms that the current parameters provide a suitably efcient minimization strategy.

In order to quantify the performance of the compression algorithm, and to compare it with that of a random selection of the reduced set of replicas, Fig. 7 shows the various contributions to the ERF, Eq. (5), for the compression of the NNPDF3.0 NLO set with

Nrep = 1000 replicas. For each

This said, there is a certain exibility in the choice of settings for the compression, for example in the choice of parameters for the genetic algorithms, the sampling of the PDFs in x or the choice of common scale for the compression Q0. The compression setup used in this paper is presented in

Table 1 together with the optimal set of GA parameters and mutation probability rates, determined by trial and error.

As mentioned before, in the present work the compression of CMC-PDFs is performed at a scale of Q0 = 2 GeV

while in the next section we use Q0 = 1 GeV for the native

NNPDF3.0 NLO set. The ERF includes only the contribution of the n f = 7 light partons: u,, d, d, s, s, and g.

Concerning the sampling of the PDFs in x, we have limited the range of x points to the region where data is available,i.e. x [105, 0.9], by selecting 35 points logarithmically

spaced between [105, 0.1] and 35 points linearly spaced

from [0.1, 0.9]. Note that this is different from the Meta-PDF

approach, where for each PDF a different range [xmin, xmax] is used for the t with the meta-parametrization, restricted to the regions where experimental constraints are available for each avor.

The correlation matrix is then computed for the n f input PDFs in Ncorrx = 5 points in x, generating a correlation matrix

of 35 entries. Increasing the number of points for the calculation of the correlation matrix would be troublesome since numerical instabilities due to the presence of large correlations between neighboring points in x would be introduced.

The genetic algorithm minimization is performed for a xed length of 15k generations. Note that as opposed to the

Iteration

123

474 Page 14 of 32 Eur. Phys. J. C (2015) 75:474

ERF Central Value - NNPDF3.0 NLO

ERF Standard deviation - NNPDF3.0 NLO

CompressedRandom Mean (1k trials) Random Median (1k trials) Random 50% c.l. (1k trials) Random 68% c.l. (1k trials) Random 90% c.l. (1k trials)

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials) Random 68% c.l. (1k trials)

Random 90% c.l. (1k trials)

Replicas

ERF Skewness - NNPDF3.0 NLO

ERF Kurtosis - NNPDF3.0 NLO

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials)

Random 68% c.l. (1k trials) Random 90% c.l. (1k trials)

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials)

Random 68% c.l. (1k trials) Random 90% c.l. (1k trials)

Replicas

ERF Kolmogorov - NNPDF3.0 NLO

ERF Correlation - NNPDF3.0 NLO

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials) Random 68% c.l. (1k trials)

Random 90% c.l. (1k trials)

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials)

Random 68% c.l. (1k trials) Random 90% c.l. (1k trials)

10 2

Replicas

Fig. 7 The various contributions to the ERF, Eq. (5), for the compression of the NNPDF3.0 NLO set with

Nrep = 1000 replicas. For each

value of Nrep, we show the value of each contribution to the ERF for the best-t result of the compression algorithm (red points). We compare the results of the compression with the values of the ERF averaged over

Nrand = 1000 random partitions of Nrep replicas (blue points), as well

as the 50, 68, and 90 % condence-level intervals computed over these random partitions. The dashed horizontal line is the 68 % lower band of the ERF for the average of the random partitions with Nrep = 100,

and is inserted for illustration purposes only

123

Eur. Phys. J. C (2015) 75:474 Page 15 of 32 474

tions of Nrep replicas (blue points), as well as the 50, 68, and 90 % condence-level intervals computed over these random partitions.

Various observations can be made from the inspection of Fig. 7. First of all, the various contributions to the ERF tend to zero when the number of compressed or random replicas tends to the size of the prior set, as expected for consistency. For the random partitions of Nrep replicas the mean value and the median values averaged over Nrand trials are not identical, emphasizing the importance of taking condence levels. From Fig. 7 we also conrm that the compression algorithm is able to provide sets of PDFs with smaller ERF values for all estimators that outperform random selections with a much larger number of replicas. To emphasize this point, the dashed horizontal line in Fig. 7 corresponds to the lower limit of the 68 % condence level of the ERF computed over Nrand = 1000 random partitions with Nrep = 100, and

is inserted for illustration purposes only. It indicates that the NNPDF3.0 NLO PDF set with

Nrep = 1000 can now be

compressed down to Nrep = 50 replicas in a way that repro

duces better the original distribution that most of the random partitions of Nrep = 100 replicas.

The results of Fig. 7 conrm that the compression algorithm outperforms essentially any random selection of replicas for the construction of a reduced set, and provides an adequate representation of the prior probability distribution with a largely reduced number of replicas. Similar results are obtained when compressing the CMC-PDFs, as we will discuss in Sect. 4.2.

4 The compressed Monte Carlo PDF sets

In this section we present the results for the CMC-PDFs, rst discussing the compression of a native Monte Carlo PDF set, in this case NNPDF3.0 with

Nrep = 1000, and then the com

pression of the MC combination for NNPDF3.0, CT14, and MMHT14 with

Nrep = 900. In both cases, we compare the

PDFs from the prior and compressed sets, for different values of the number of replicas Nrep of the latter. We also verify that correlations between PDFs are successfully reproduced by the compression. The phenomenological validation of the CMC-PDF sets at the level of LHC observables is addressed in Sect. 5.

4.1 Compression of native MC PDF sets

First of all, we show the results for the compression of a native MC PDF set, for the case of the NNPDF3.0 NLO set with

Nrep = 1000 replicas. In Fig. 8 we compare the orig-

inal and the compressed gluon and down quark at Q2 = 2

GeV2, using Nrep = 50 in the compressed set. Excellent

agreement can be seen at the level of central values and variances. The comparison is also shown at a typical LHC scale of Q = 100 GeV, nding similar agreement. The plots in this

section have been obtained using the APFEL- Web online PDF plotter [58,59]. The result that the central values of the original set are perfectly reproduced by the compressed set can also be seen from Fig. 9, where we show the distribution of 2 for all the experiments included in the NNPDF3.0 t, comparing the original and the compressed PDF set, and nd that they are indistinguishable.

Next, we compare in Fig. 10 the various PDF luminosi-ties between the original and the compressed set at the LHC with center-of-mass energy of s = 13 TeV. We show

the gluongluon, quarkantiquark, quarkgluon, and quark quark luminosities. As in the case of the individual PDF avors, good agreement is found in all the range of possible nal state invariant masses MX. Note that the agreement is also good in regions, like small MX and large MX, where the underlying PDF distribution is known to be non-Gaussian.

It is also important to verify that not only central values and variances are reproduced, but also that higher moments and correlations are well reproduced by the compression. Indeed, one of the main advantages of the Nrep = 1000 replica sets

of NNPDF as compared to the Nrep = 100 sets is that cor

relations should be reproduced more accurately in the former case. In Fig. 11 we show the results for the correlation coefcient between different PDFs, as a function of Bjorken-x, for Q = 1 GeV. We compare the results of the original Nrep = 1000 replica set, together with the results of the com

pressed sets for a number of Nrep values. From top to bottom and from left to right we show the correlations between up and down quarks, between up and strange antiquarks, between down quarks and down antiquarks, and between up quarks and down antiquarks. The correlations between PDF avors have been computed using the suitable expression for Monte Carlo sets [31]. As we can see, correlations are reasonably well reproduced, already with Nrep = 50 the results of the

compressed set and of the prior are very close to each other.

Another illustration of the fact that PDF correlations are maintained in the compression is provided by Fig. 12, where we show the correlation matrix of the NNPDF3.0 set at a scale of Q = 100 GeV, comparing the prior with [tildewide]

Nrep = 1000 with

the compressed set with Nrep = 50 replicas. The correlation

matrices presented here are dened in a grid of Nx = 50

points in x, logarithmic distributed between [105, 1] for

each avor (s,, d, g, d, u, s). To facilitate the comparison,

in the bottom plot we show the differences between the correlation coefcients in the two cases. It is clear from this comparison that the agreement of the PDF correlations reported in Fig. 11 holds for the complete set of possible PDF combinations, in all the relevant range of Bjorken-x.

123

474 Page 16 of 32 Eur. Phys. J. C (2015) 75:474

xg(x,Q), comparison

Prior 1000 replicas Compressed 50 replicas Q = 1.41 GeV

Generated with APFEL 3.0.0 Web

xd(x,Q), comparison

Prior 1000 replicas Compressed 50 replicas Q = 1.41 GeV

Generated with APFEL 3.0.0 Web

1.2

0.8

0.6

0.4

0.2

-1

-2

-5

10 -4

10 -3

10 -2

10 -1

10 1

-5

10 -4

10 -3

10 -2

10 -1

10 1

xg(x,Q), comparison

Prior 1000 replicas Compressed 50 replicas Q = 100 GeV

Generated with APFEL 3.0.0 Web

xd(x,Q), comparison

Prior 1000 replicas Compressed 50 replicas Q = 100 GeV

Generated with APFEL 3.0.0 Web

1.1

1.08

1.06

1.04

1.02

Ratio

0.98

0.96

0.94

0.92

0.9

10 4

10 3

10 2

10 1

10 4

10 3

10 2

10 1

Fig. 8 Upper plots comparison of the prior NNPDF3.0 NLO set with

Nrep = 1000 and the compressed set with Nrep = 50 replicas, for the

gluon and the down quark at the scale Q2 = 2 GeV2. Lower plots the

same comparison this time at a typical LHC scale of Q = 100 GeV,

normalized to the central value of the prior set

Distribution of

2 for experiments

2 Prior 1000 replicas Compressed 50 replicas

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

Experiments

Fig. 9 Distribution of 2 for all the experiments included in the NNPDF3.0 t, comparing the original and the compressed PDF sets

Having validated the compression results for a native MC set, we now turn to a discussion of the results of the compression for a combined MC PDF set.

4.2 Compression of the CMC-PDFs

Now we turn to a similar validation study but this time for the CMC-PDFs. As we have discussed in Sect. 2, the combined MC set has been constructed by adding together Nrep = 300

replicas of NNPDF3.0, MMHT14, and CT14 each, for a total of

Nrep = 900 replicas. Starting from this prior set, the com

pression algorithm has been applied as discussed in Sect. 3, and we have produced CMC-PDF sets for a number of values of Nrep from 5 to 250 replicas, using the settings from

Sect. 3.2.5.

We have veried that the performance of the compression algorithm is similar regardless of the prior. To illustrate this point, in Fig. 13 we show the corresponding version of Fig. 7, namely the various contributions to the error function, for the case of compression of the CMC-PDF sets. We see that also in the case of the CMC-PDF sets the compression improves the ERF as compared to random selections by an order of magnitude or even more.

It is interesting to determine, for a given compression, how many replicas are selected from each of the three PDF sets

123

Eur. Phys. J. C (2015) 75:474 Page 17 of 32 474

Gluon-Gluon, luminosity

Prior 1000 replicas

Compressed 50 replicas

= 13 TeV

Generated with APFEL 3.0.0 Web

Quark-Antiquark, luminosity

Prior 1000 replicas

Compressed 50 replicas

= 13 TeV

Generated with APFEL 3.0.0 Web

1.2

1.15

1.1

1.05

Ratio

0.95

0.9

0.85

0.8

10 3

[GeV]

Quark-Gluon, luminosity

Prior 1000 replicas

Compressed 50 replicas

= 13 TeV

Generated with APFEL 3.0.0 Web

Quark-Quark, luminosity

Prior 1000 replicas

Compressed 50 replicas

= 13 TeV

Generated with APFEL 3.0.0 Web

1.2

1.15

1.1

1.05

Ratio

0.95

0.9

0.85

0.8

10 3

2 [GeV]

Fig. 10 Comparison of PDF luminosities between the original and compressed NNPDF3.0 set, for the LHC 13 TeV as a function of the invariant mass of the nal state MX . From top to bottom and left to

right, we show the gluongluon, quarkantiquark, quarkgluon, and quarkquark luminosities

that enter the combination. Given that originally we assign equal weight to the three sets, that is, the same number of replicas, we expect that if the compression algorithm is unbiased the number of replicas from each set after the compression should also be approximately the same. We have veried that this is indeed the case, for instance, in Fig. 14 we show, for a compression with Nrep = 100 replicas, how the replicas

of the original distribution are selected: we see that a similar number has been selected from NNPDF3.0, CT14, and MMHT14: 32, 36, and 32 replicas, respectively, in agreement with our expectations.

We now address the comparison between the MC900 prior and the new CMC-PDFs. For illustration, we will show results for Nrep = 100, with the understanding that using

a larger number of replicas would improve even further the agreement with the prior. In Fig. 15 we show the comparison of the PDFs between the original Monte Carlo combination of NNPDF3.0, CT14 and MMHT14, with

Nrep = 900 repli

cas, with the corresponding compressed set with Nrep = 100

replicas. We show the gluon, up quark, down antiquark, and

strange quark, as ratios to the prior set at a typical LHC scale

of Q = 100 GeV. We see that in all cases the agreement is

sufciently good.

In Fig. 16 we show the same as in Fig. 2, namely the histograms representing the distribution of the values of the PDFs over the Monte Carlo replicas for different avors and values of (x, Q), now comparing the original and compressed CMC-PDFs with

Nrep = 900 and Nrep = 100, respectively.

As was done in Figs. 2 and 16 we also show a Gaussian with mean and variance determined from the prior

Nrep = 900

CMC-PDF.

To gauge the dependence of the agreement between the prior and the compressed Monte Carlo sets, it is illustrative to compare central values and variances for the different values of Nrep in the compression. This comparison is shown for the gluon and the down antiquark in Fig. 17. In the left plots, we compare the central value of the PDF for different values of Nrep, normalized to the prior result. We also show the one-sigma PDF band, which is useful to compare the deviations found in the compressed set with the typical statistical uctuations. We see that starting from Nrep 25

replicas, the central values of the compressed sets uctuate

123

474 Page 18 of 32 Eur. Phys. J. C (2015) 75:474

Fig. 11 Comparison between the PDF correlations among different PDF avors, as a function of Bjorken-x, for Q = 1 GeV, for the origi

nal NNPDF3.0 set with Nrep = 1000 replicas and the compressed sets

for various values of Nrep. From top to bottom and from left to right,

we show the correlations between up and down quarks, between up and strange antiquarks, between down quarks and down antiquarks, and between up quarks and down antiquarks

much less than the size of the PDF uncertainties. In the right plot of Fig. 17 we show the corresponding comparison at the level of standard deviations, again normalized to the standard deviation of the prior set. Here for reference the green band shows the variance of the variance itself, which is typically of the order of 2030 % in a Monte Carlo PDF set [31]. Here we see that with Nrep 100 replicas or more, the variance

of the compressed set varies by a few percent at most, much less than the statistical uctuations of the PDF uncertainty itself.

As in the case of the native Monte Carlo sets, it is also useful here for the CMC-PDFs to compare the parton luminosities between the original and the compressed sets. This comparison is shown in Fig. 18, which is the analog of Fig. 10 in the case of CMC-PDFs. As in the case of the native sets, we nd also here good agreement at the level of PDF luminosities. As we will see in the next section, this agreement

will also translate to all LHC cross sections and differential distributions that we have explored.

Having veried in a number of ways that central values and variances of the PDFs are successfully preserved by the compression, we turn to a study of the PDF correlations. We have veried that a similar level of agreement as in the case of the native MC sets, Fig. 11, is achieved also here. To illustrate this point, in Fig. 19 we show a comparison of the correlation coefcients as a function of x, for Q = 100 GeV, for different

PDF combinations, between the original CMC-PDF set with

Nrep = 900 replicas and the compressed sets for different

values of Nrep. From left to right and from top to bottom we show the correlation between gluon and up quark, between up and strange quarks, between gluon and charm quark, and between the down and up quarks. We see that already with Nrep = 100 replicas the result for the correlation is close

enough to the prior with

Nrep = 900 replicas.

123

Eur. Phys. J. C (2015) 75:474 Page 19 of 32 474

Fig. 12 The correlation matrix of the NNPDF3.0 set with

Nrep = 1000 at Q = 100 GeV. On the right, the same matrix for the NNPDF3.0

compressed set with Nrep = 50 replicas. The bottom plot represents the difference between the two matrices. See text for more details

The analogous version of Fig. 12 for the correlation matrix of the CMC-PDFs is shown in Fig. 20. As in the case of the native MC sets, also for the CMC-PDFs the broad pattern of the correlation matrix of the original combination with

Nrep = 900 replicas is maintained by the compression to

Nrep = 100 replicas, as is quantied by the bottom plot, rep

resenting the differences between the correlation coefcients in the two cases.

5 CMC-PDFs and LHC phenomenology

Now we present the validation of the compression algorithm applied to the combination of Monte Carlo PDF sets for a variety of LHC cross sections. We will compare the results of the original combined Monte Carlo set MC900 with those of the CMC-PDFs with Nrep = 100 replicas (CMC-PDF100).

This validation has been performed both at the level of inclusive cross sections and of differential distributions with realistic kinematical cuts. All cross sections will be computed with the NNLO sets, even when the hard cross sections are computed at NLO, which is suitable for the present illustration purposes.

First of all, we compare the MC900 prior and the CMCPDFs for benchmark inclusive LHC cross sections, and then we perform the validation for LHC differential distributions including realistic kinematical cuts. In the latter case we use fast NLO interfaces for the calculation of these LHC observables: this allows us to straightforwardly repeat the validation when different PDF sets are used for the compression without the need to repeat any calculation. Finally, we verify that the correlations between physical observables are also maintained by the compression algorithm, both for inclusive cross sections and for differential distributions.

123

474 Page 20 of 32 Eur. Phys. J. C (2015) 75:474

ERF Central Value - CMC-PDF NNLO

ERF Standard deviation - CMC-PDF NNLO

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials) Random 68% c.l. (1k trials)

Random 90% c.l. (1k trials)

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials) Random 68% c.l. (1k trials)

Random 90% c.l. (1k trials)

Replicas

ERF Skewness - CMC-PDF NNLO

ERF Kurtosis - CMC-PDF NNLO

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials) Random 68% c.l. (1k trials)

Random 90% c.l. (1k trials)

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials) Random 68% c.l. (1k trials)

Random 90% c.l. (1k trials)

Replicas

ERF Kolmogorov - CMC-PDF NNLO

ERF Correlation - CMC-PDF NNLO

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials) Random 68% c.l. (1k trials)

Random 90% c.l. (1k trials)

CompressedRandom Mean (1k trials)

Random Median (1k trials) Random 50% c.l. (1k trials) Random 68% c.l. (1k trials)

Random 90% c.l. (1k trials)

Replicas

Fig. 13 Same as Fig. 7 for the CMC-PDFs, starting from the prior with

Nrep = 900 replicas

123

Eur. Phys. J. C (2015) 75:474 Page 21 of 32 474

5.1 LHC cross sections and differential distributions

We begin with the validation of the CMC-PDF predictions at the level of inclusive cross sections. The following results have been computed for the LHC at a centre-of-mass energy of 13 TeV. In Fig. 21 we compare the results obtained with the prior Monte Carlo combined set and with the CMCPDFs with Nrep = 100 replicas, everything normalized to

the central value of the prior set. The processes that have been included in Fig. 21 are the same as those considered in the benchmark comparisons of Sect. 2.2. As we can see from Fig. 21, in all cases the agreement at the central-value level is always at the permille level, and also the size of the PDF uncertainties is very similar between the original and compressed set. Taking into account the uctuations of the PDF uncertainty itself, shown in Fig. 17, it is clear that the predictions from the original and the compressed sets are statistically equivalent.

CMC-PDF NNLO - 100 replicas distribution

NNPDF3.0

CT14

MMHT14

32 replicas

36 replicas

32 replicas

Entries

0 0 100 200 300 400 500 600 700 800 900

Replicas

Fig. 14 Replicas of the original combined set of

Nrep = 900 replicas

selected for the compression with Nrep = 100 replicas, classied for

each of the three input PDF sets

4 GeV

2 =10

NNLO, Q

)=0.118

4 GeV

2 =10

NNLO, Q

)=0.118

1.2

1.15

MC900

CMC100

1.15

MC900

CMC100

) [ref]

) / g ( x, Q

g ( x, Q

) [ref]

) / u ( x, Q

u ( x, Q

2 1.1

1.05

0.95

0.9

10 4

10 2

10 1

10 4

10 2

10 1

4 GeV

2 =10

NNLO, Q

)=0.118

4 GeV

2 =10

NNLO, Q

)=0.118

1.2

1.15

MC900

CMC100

1.15

MC900

CMC100

) [ref]

( x, Q

d ) /

( x, Q

) [ref]

( x, Q

) /

( x, Q

2 1.1

1.05

0.95

0.9

10 4

10 2

10 1

10 4

10 2

10 1

Fig. 15 Comparison of the PDFs between the original Monte Carlo combination of NNPDF3.0, CT14, and MMHT14, MC900, with the compressed CMC100 PDFs. We show the gluon, up quark, down antiquark, and total quark singlet, as ratios to the prior for Q2 = 104 GeV2

123

474 Page 22 of 32 Eur. Phys. J. C (2015) 75:474

g( x=0.01, Q=100 GeV )

-5 , Q=100 GeV )

u( x=5 10

MC900 Gaussian CMC100

0.6

0.5

Probability per bin

0.4

0.3

0.2

0.1

0 7.4 7.6 7.8 8 8.2 8.4

0 3.8 4 4.2 4.4 4.6 4.8

x*PDF( x, Q )

( x=0.20, Q=100 GeV )

s( x=0.05, Q=100 GeV )

MC900 Gaussian CMC100

0.6

0.5

Probability per bin

0.4

0.3

0.2

0.1

0 0.026 0.028 0.03 0.032 0.034 0.036 0.038 0.04 0.042 0.044 0.046

0 0.08 0.1 0.12 0.14 0.16 0.18 0.2

x*PDF( x, Q )

Fig. 16 Same as Fig. 2 now with the comparison between MC900 and CMC100. The Gaussian curve has the same mean and variance as the MC900 prior

Having established that the compression works for total cross sections, one might question if perhaps the accuracy degrades when we move to differential distributions, especially if one considers extreme regions of the phase space and the effects of realistic nal state kinematical cuts. To verify that this is not the case, now we consider a number of differential processes computed using MCFM [60] and NLO-jet++ [61] interfaced to APPLgrid [62] as well as Mad-Graph5_aMC@NLO [37] interfaced to aMCfast [63] and APPLgrid. All processes are computed for s = 7 TeV,

and the matrix-element calculations have been performed at xed NLO perturbative order. The advantage of using fast NLO grids is that it is straightforward to repeat the validation without having to redo the full NLO computation when a different set of input PDFs is used for the combination. Note that while for simplicity we only show the results for selected bins, we have veried that the agreement also holds for the complete differential distribution.

The corresponding version of Fig. 21 for the case of LHC 7 TeV differential distributions is shown in Fig. 22. The theoretical calculations are provided for the following processes:

The ATLAS high-mass DrellYan measurement [64],

integrated over rapidity |yll| 2.1, and binned as a

function of the di-lepton invariant mass pair Mll. Here we show the prediction for the highest mass bin, Mll

[1.0, 1.5] TeV.

The CMS double differential DrellYan measurement

[65] in the low-mass region, 20 GeV Mll 30 GeV,

as a function of the di-lepton rapidity yll. The prediction is shown for the lowest rapidity bin, yll [0.0, 0.1].

The CMS W+ lepton rapidity distribution [66]. The

prediction is shown for the lowest rapidity bin, yl

[0.0, 0.1].

The CMS measurement of W+ production in association

with charm quarks [67], as a function of the lepton rapid-

123

Eur. Phys. J. C (2015) 75:474 Page 23 of 32 474

Central Value, Gluon PDF, Q = 100 GeV

Standard Deviation, Gluon PDF, Q = 100 GeV

1.1

1.8

Nrep = 5 Nrep = 25 Nrep = 50 Nrep = 100 Nrep = 250

1.08

1.6

Nrep = 5 Nrep = 25 Nrep = 50 Nrep = 100 Nrep = 250

1.06

1.4

Ratio to MC900

1.04

1.2

1.02

0.98

0.8

0.96

0.6

0.94

0.4

10 10 10 10

Central Value, Anti-Down PDF, Q = 100 GeV

Standard Deviation, Anti-Down PDF, Q = 100 GeV

1.1

1.8

Nrep = 5 Nrep = 25 Nrep = 50 Nrep = 100 Nrep = 250

1.08

1.6

Nrep = 5 Nrep = 25 Nrep = 50 Nrep = 100 Nrep = 250

1.06

1.4

Ratio to MC900

1.04

1.2

1.02

0.98

0.8

0.96

0.6

0.94

0.4

10 10 10 10

Fig. 17 Comparison of the central values (left plots) and one-sigma intervals (right plots) for the CMC-PDFs with different values of Nrep (5, 25, 50, 100, and 250, respectively), for the gluon (upper plots) and the down antiquark (lower plots). Results are shown normalized to the

central value and the standard deviation of the MC900 prior combined set, respectively. We also show the one-sigma PDF band (left plots) and the variance of the variance (right plots) as a full green band

ity yl. The prediction is shown for the lowest rapidity bin, yl [0.0, 0.3].

The ATLAS inclusive jet production measurement [68]

in the central rapidity region, |yjet| 0.3, as a function

of the jet pT . The prediction is shown for the lowest pT bin, pT [20, 30] GeV.

The same ATLAS inclusive jet production measure

ment [68] now in the forward rapidity region, 3.6 |yjet| 4.4, as a function of the jet pT . The prediction is

shown for the highest pT bin, pT [110, 160] GeV.

More details as regards the selection cuts applied to these processes can be found in the original references and in the NNPDF3.0 paper [16], though note that here no comparison with experimental data is attempted. The various observables of Fig. 22 probe a wide range of PDF combinations, from light quarks and antiquarks (low- and high-mass DrellYan) and strangeness (W+charm) to the gluon (central and forward

jets) in a wide range of Bjorken-x and momentum transfers Q2.

As we can see from Fig. 22, the level of the agreement between the MC900 prior and the CMC-PDFs with Nrep =

100 is similar to that of the inclusive cross sections.

This is also true for other related processes that we have also studied, but that are not shown explicitly here. This agreement is of course understood from the fact that the compression is performed at the level of parton distributions, as shown in Sect. 4. Note also that the agreement found for the processes in Fig. 22 is particularly remarkable since in some cases, like forward DrellYan or forward jet production, the underlying PDFs are probed at large-x, where deviations from the Gaussian behavior are sizable: even in this case, the compression algorithm is successful in reproducing the mean and variance of the prior probability distribution.

123

474 Page 24 of 32 Eur. Phys. J. C (2015) 75:474

LHC 13 TeV, NNLO,

)=0.118, Ratio to MC900

LHC 13 TeV, NNLO,

)=0.118, Ratio to MC900

MC900

CMC100

MC900

CMC100

1.2

Gluon - Gluon Luminosity

Quark - Antiquark Luminosity

1.15

1.1

1.05

0.95

0.9

0.85

10 3

2 ( GeV )

10 3

2 ( GeV )

LHC 13 TeV, NNLO,

)=0.118, Ratio to MC900

LHC 13 TeV, NNLO,

)=0.118, Ratio to MC900

MC900

CMC100

MC900

CMC100

1.2

Quark - Quark Luminosity

Quark - Gluon Luminosity

1.15

1.1

1.05

0.95

0.9

0.85

10 3

2 ( GeV )

10 3

2 ( GeV )

Fig. 18 Same as Fig. 10 for the comparison between the prior set MC900 and the compressed set CMC100

Another illustrative way of checking that the compression algorithm really preserves the non-Gaussian features of the prior is provided by the probability distribution of specic LHC cross sections in which such features are clearly observed. To better visualize the probability density P() estimated from the Monte Carlo sample we use the Kernel Density Estimation (KDE) method. In this technique, the probability distribution is obtained by averaging a kernel function K centered at the predictions {i} obtained for

each individual PDF replica:

P() =

1 Nrep

Nrep

[summationdisplay]

i=1

K ( i). (22)

Here we choose the function K to be a normal distribution, that is,

K ( i) =

1 h2 e

(i )

2h , (23)

where we set the parameter h, known as bandwidth, so that it is the optimal choice if the underlying data was Gaussian. This choice is known as the Silverman rule.2

In Fig. 23 we compare the probability distributions, obtained using the KDE method, for two LHC cross sections: the CMS W+charm production in the most forward bin (left plot) and the LHCb Z e+e rapidity distribu

tion for Z = 4 (right plot). We compare the original prior

MC900 with the CMC-PDF100 and MCH100 reduced sets. In the case of the W+charm cross section, which is directly sensitive to the poorly known strange PDF, the prior shows a double-hump structure, which is reasonably well reproduced by the CMC-PDF100 set, but that disappears if a Gaussian

2 It can be shown that this choice amounts to using a bandwidth of

h =

[parenleftBigg] 4s5

3Nrep

, (24)

where s is the standard deviation of the sample.

123

Eur. Phys. J. C (2015) 75:474 Page 25 of 32 474

Fig. 19 Same as Fig. 11 for correlation coefcients of the CMC-PDFs, evaluated at Q = 100 GeV, for a range of values of Nrep in the com

pressed set, from 5 to 100 replicas, compared with the prior MC900

result. From left to right and from top to bottom we show the correlation between gluon and up quark, between up and strange quarks, between gluon and charm quark, and between the down and up quarks

reduction, in this case MCH100, is used. For the LHCb forward Z production, both the prior and CMC-PDF100 are signicantly skewed, a feature which is lost in the Gaussian reduction of MCH100.

5.2 Correlations between LHC cross sections

Any reasonable algorithm for the combination of PDF sets should reproduce not only the central values and the variances of the prior distribution, but also the correlations between physical observables. This is relevant for phenomenological applications at the LHC, where PDF-induced correlations are used for instance to determine the degree of correlation of the systematic uncertainties between different processes. Using the PDF4LHC recommendations, the PDF-induced correlations between different Higgs production channels were estimated in Ref. [55], and this information is now extensively used in the Higgs analyses of ATLAS and CMS.

To validate that the compression algorithm presented here also maintains the correlations of the original set, we have computed the correlations between all processes used in the previous section, both for the MC900 prior and for the CMCPDF100 set. The results are shown in Fig. 24, for the NLO and NNLO inclusive cross sections shown in Figs. 21 and 25, for the case of differential distributions shown in Fig. 22. We have also veried that from Nrep 50 replicas onwards

the correlations are very well reproduced by the compressed set.

To gauge the effectiveness of the compression algorithm, in Figs. 24 and 25 we also show the 68 % condence-level interval for the correlation coefcients computed from Nrand = 1000 random partitions of Nrep = 100 replicas: we

see the compression in general outperforms the results from a random selection of a Nrep = 100 replica set. The agreement

of the correlations at the level of LHC observables is a direct consequence of course that correlations are maintained by the compression at the PDF level, as discussed in detail in

123

474 Page 26 of 32 Eur. Phys. J. C (2015) 75:474

Fig. 20 Same as Fig. 12 for the correlation matrix of the CMC-PDFs at Q = 100 GeV, comparing the prior combination MC900 (left plot) and

the CMC-PDF100 set (right plot). In the bottom plot we show the difference between the correlation coefcients in the two cases

Sect. 4.2. Only for very few cases the correlation coefcient of the CMC-PDF set is outside the 68 % condence-level range of the random selections, and this happens only when correlations are very small to begin with, so this fact is not relevant for phenomenology.

To summarize, the results of this section show that at the level of LHC phenomenology, CMC-PDFs with Nrep = 100

replicas can be reliably used instead of the original Monte Carlo combination of PDF sets, thereby allowing a substantial reduction of the CPU-time burden associated with the calculation of the theory predictions for the original

Nrep = 900

replicas by almost a full order of magnitude.

6 Summary and delivery

In this work we have presented a novel strategy for the combination of individual PDF sets, based on the Monte Carlo

method followed by a compression algorithm. The resulting Compressed Monte Carlo PDFs, or CMC-PDFs for short, are suitable to be used to estimate PDF uncertainties in theoretical predictions of generic LHC processes. As compared to the original PDF4LHC recommendation, the new approach we advocate here is both more straightforward to use, based on a single combined PDF set, and less computationally expensive: Nrep 100 replicas are enough to preserve the statisti

cal features of the prior combination with sufcient accuracy for most relevant applications. Using as an illustration the combination of the recent NNPDF3.0, CT14, and MMHT14 NNLO sets, we have veried that the compression algorithm successfully reproduces the predictions of the prior combined MC set for a wide variety of LHC processes and their correlations.

The compressed PDF sets at NLO and NNLO, with Nrep = 100 replicas each, and s(MZ) = 0.118, will be

made available in LHAPDF6 [49] as part of the upcom-

123

Eur. Phys. J. C (2015) 75:474 Page 27 of 32 474

LHC 13 TeV, NNLO,

=0.118

= 100

rep

gg->h

0.96 0.97 0.98 0.99 1 1.01 1.02 1.03 1.04 1.05

Ratio to MC900

Fig. 21 Comparison of the predictions of the Monte Carlo combined prior MC900 with those of the CMC-PDFs with Nrep = 100 replicas,

normalized to the central value of the former, for a number of benchmark inclusive NNLO cross sections at the LHC with s = 13 TeV. The error

bands correspond to the PDF uncertainty bands for each of the sets. See text for more details

LHC 7 TeV,

=0.118, NLO

= 100

rep

Low-Mass DY

High-Mass DY

Forw DY

W+charm

Cent Jets

Forw Jets

0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2

Ratio to original Monte Carlo combined PDFs

Fig. 22 Same as Fig. 21, for a variety of NLO differential distributions computed with MCFM and NLOjet++ interfaced to APPLgrid for the LHC with s = 7 TeV. See text for the details of the choice of binning

in each process

ing PDF4LHC 2015 recommendations. Additional members to estimate the combined PDF+s uncertainty will also be included in the same grid les, and new functions will be provided in LHAPDF 6.1.6 to facilitate the computation of this combined PDF+s uncertainty. In addition, we have also made publicly available the compression algorithm used in this work:

https://github.com/scarrazza/compressor

Web End =https://github.com/scarrazza/compressor

This compressor code [50] includes a script to combine Monte Carlo sets from different groups into a single MC set, the compression algorithm and the validation suite. A concise user manual for this code can be found in the appendix: the code produces CMC-PDF sets directly in the LHAPDF6 format ready to be used for phenomenological applications.

We would like to emphasize that it is beyond the scope of this paper to determine which specic PDF sets should be used in the present or future PDF4LHC combination: this is an issue in which only the PDF4LHC Steering Committee has the mandate to decide. We have used the most updated NNLO sets from NNPDF, CT, and MMHT for consistency with the current prescription, but using the publicly available code it is possible to construct CMC-PDFs from any other choice of sets. We note, however, that for the combination of PDF sets that are based on very different input datasets or theory assumptions as compared to the three global sets, the determination of the number of replicas from each set that should be included in the combination is a complex problem which is still to be understood.

Examples of applications where combined PDF sets, as implemented by the CMC-PDFs, should be used include the computation of PDF uncertainties for acceptances and efciencies, due to extrapolations or interpolations, to estimate the PDF uncertainties in the extraction of Higgs couplings or other fundamental SM parameters such as MW from LHC data, and to obtain limits in searches for BSM physics. Even in these cases, whenever possible, providing results obtained using individual PDF sets should be encouraged, since such comparisons shed light on the origin of the total PDF uncertainties for each particular application, and provide guidance about how they might reduce this PDF uncertainty. Needless to say, in all PDF-sensitive Standard Model comparisons between experimental data and theory models, only the individual PDF sets should be used, rather than only a combined PDF set. The latter might be suitable only if PDF uncertainties are much smaller than all other theoretical and experimental uncertainties.

It is also important to emphasize that the CMC-PDFs, as well as any other method for the combination of PDF sets, do not replace the individual PDF sets: CMC-PDFs are simply a user-convenient method to easily obtain the results of the combination of the individual PDF sets. For this reason, it should be clear that whenever the CMC-PDF sets are used, not only the present publication should be cited, but also the original publications corresponding to the individual PDF sets used as input to the combination.

Let us conclude by stating the obvious fact that the availability of a method for the combination of different sets does not reduce, but if anything strengthens, the need to keep working in reducing the PDF uncertainties in the individual sets, both in terms of improved theory, more constraining data and rened methodology, as well as to continue the benchmark-

123

474 Page 28 of 32 Eur. Phys. J. C (2015) 75:474

Fig. 23 The probability distribution for two LHC cross sections: the CMS W+charm production in the most forward bin (left plot) and the LHCb Z e+e rapidity distribution for Z = 4 (right plot). We com-

pare the original prior MC900 with the results from the CMC-PDF100 and MCH100 reduced sets

Correlation Coefficient for gg->h

Correlation Coefficient for tT

= 100

rep

Reference

Compressed

Random (68% CL)

gg->h W tT W Z

= 100

rep

Reference

Compressed

Random (68% CL)

gg->h W tT W Z

Correlation coefficient

0.5

Correlation Coefficient for W

Correlation Coefficient for Z

= 100

rep

Reference

Compressed

Random (68% CL)

gg->h W tT W Z

= 100

rep

Reference

Compressed

Random (68% CL)

gg->h W tT W Z

Correlation coefficient

0.5

Correlation coefficient

0.5

Fig. 24 Comparison of the correlation coefcients computed from the reference Monte Carlo combined set and from the CMC-PDFs with Nrep = 100 replicas. We show here the results for the correlations

between the inclusive LHC cross sections, using the settings described in the text. Each plot contains the correlation coefcient of a given cross

section with respect to all the other inclusive cross sections considered here. To gauge the effectiveness of the compression algorithm, we also show the 68 % condence-level interval for the correlation coefcients computed from Nrand = 1000 random partitions of Nrep = 100 replicas

each

123

Eur. Phys. J. C (2015) 75:474 Page 29 of 32 474

Correlation Coefficient for Low-Mass DY

Correlation Coefficient for Forw DY

= 100

rep

= 100

rep

Correlation coefficient

0.5

Reference

Compressed

Random (68% CL)

Reference

Compressed

Random (68% CL)

0.5

Low-Mass DY Forw DY Cent JetsHigh-Mass DY W+charm Forw Jets

Correlation Coefficient for Cent Jets

Correlation Coefficient for Forw Jets

= 100

rep

= 100

rep

Correlation coefficient

0.5

Reference

Compressed

Random (68% CL)

Reference

Compressed

Random (68% CL)

0.5

Low-Mass DY Forw DY Cent JetsHigh-Mass DY W+charm Forw Jets

Fig. 25 Same as Fig. 24 for some of the various LHC NLO differential cross sections discussed in the text. From top to bottom and from left to right we show the correlations for low-mass DrellYan, forward DrellYan, and central and forward jets

ing exercises between groups that have been performed in the past [4,69,70] and that are instrumental to understand (and eventually reduce) the differences between different groups.

Acknowledgments We are grateful to Joey Huston, Jun Gao and Pavel Nadolsky for many illuminating discussions of the comparison between the CMC-PDF and the Meta-PDF approaches. We also are grateful to all the members of the NNPDF Collaboration for fruitful discussions during the development of this project. This work of S. C. is partially supported by an Italian PRIN2010 grant and by a European Investment Bank EIBURS grant. The work of J. I. L. is funded by a FIS2010-16185 grant from the Spanish MINNECO. The work of J. R. is supported by an STFC Rutherford Fellowship ST/K005227/1 and by the European Research Council Starting Grant PDF4BSM.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/

Web End =http://creativecomm http://creativecommons.org/licenses/by/4.0/

Web End =ons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Funded by SCOAP3.

A: The compression code

The numerical implementation of the compression algorithm used in this work is available in the compressor v1.0.0

package. Here we provide instructions about how to download, install, and run the code. The code was designed for Unix systems.

Download

Open a terminal and download the latest release available from

https://github.com/scarrazza/compressor/releases

Web End =https://github.com/scarrazza/compressor/releases

or clone the master development branch from the GitHub repository:

$ git clone https://github.com/scarrazza/compressor.git

Installation

Compressor requires three external public libraries in order to work properly: LHAPDF63 [49], ROOT4 and GSL.5 In

3 http://lhapdf.hepforge.org/

Web End =http://lhapdf.hepforge.org/ .

4 http://root.cern.ch/

Web End =http://root.cern.ch/ .

5 http://www.gnu.org/software/gsl/

Web End =http://www.gnu.org/software/gsl/ .

123

474 Page 30 of 32 Eur. Phys. J. C (2015) 75:474

order to install the package compile with the congure script:

$ cd compressor $ ./configure

--prefix=/path/to/install/location $ make && make install

This operation will copy the bin/compressor binary to /usr/local/bin (or the location given by prefix).

Running the code

After installing this package, the compressor program is available for the user.

$ compressor --helpusage: ./compressor [REP] [PDF prior name]

[energy Q=1] [seed=0] [compress=1]

The rst two arguments are required:

REP: the number of required compressed replicas PDF prior name: the name of the prior LHAPDF6

grid

energy Q: the input energy scale used by the compres

sion algorithm (default = 1 GeV)

seed: the random number seed (default = 0) compress: switches on/off the minimization step

(default = true).

Output

After running compressor a folder with the prior set name is created.

$ compressor 100 MyPriorSet...$ ls MyPriorSet/erf_compression.dat # contains the erf. values for the compressed set erf_random.dat # contains the erf. values for the random set replica_compression_100.dat # list of compressed replicas from the prior

The script /bin/compressor_buildgrid creates the compressed LHAPDF6 grid:

$ ./compressor_buildgrid --helpusage: ./compressor_buildgrid [prior set name] [number of compressed replicas]

Finally, in order to generate the ERF plots place the /bin/compressor_validate.C script in the output folder and run:

$ root -l compressor_validate.C

References

1. S. Forte, Parton distributions at the dawn of the LHC. Acta Phys. Polon. B 41, 28592920 (2010). http://arxiv.org/abs/1011.5247

Web End =arXiv:1011.5247

2. S. Forte, G. Watt, Progress in the determination of the partonic structure of the proton. Ann. Rev. Nucl. Part. Sci. 63, 291328 (2013). http://arxiv.org/abs/1301.6754

Web End =arXiv:1301.6754

3. E. Perez, E. Rizvi, The quark and gluon structure of the proton. Rep. Prog. Phys. 76, 046201 (2013). http://arxiv.org/abs/1208.1178

Web End =arXiv:1208.1178

4. R.D. Ball, S. Carrazza, L. Del Debbio, S. Forte, J. Gao et al., Parton distribution benchmarking with LHC data. JHEP 1304, 125 (2013). http://arxiv.org/abs/1211.5142

Web End =arXiv:1211.5142

5. G. Watt, Parton distribution function dependence of benchmark standard model total cross sections at the 7 TeV LHC. JHEP 1109, 069 (2011). http://arxiv.org/abs/1106.5788

Web End =arXiv:1106.5788

6. A. De Roeck, R.S. Thorne, Structure functions. Prog. Part. Nucl. Phys. 66, 727781 (2011). http://arxiv.org/abs/1103.0555

Web End =arXiv:1103.0555

7. J. Rojo et al., The PDF4LHC report on PDFs and LHC data: Results from Run I and preparation for Run II. http://arxiv.org/abs/1507.00556

Web End =arXiv:1507.00556

8. LHC Higgs Cross Section Working Group, S. Heinemeyer, C. Mariotti, G. Passarino, R. Tanaka (Eds.), Handbook of LHC Higgs Cross Sections: 3. Higgs Properties, CERN-2013-004 (CERN, Geneva, 2013). http://arxiv.org/abs/1307.1347

Web End =arXiv:1307.1347

9. LHeC Study Group Collaboration, J. Abelleira Fernandez et al., A Large Hadron Electron Collider at CERN: Report on the Physics and Design Concepts for Machine and Detector. J. Phys. G 39, 075001 (2012). http://arxiv.org/abs/1206.2913

Web End =arXiv:1206.2913

10. C. Borschensky, M. Krmer, A. Kulesza, M. Mangano, S. Padhi et al., Squark and gluino production cross sections in pp collisions at s = 13, 14, 33 and 100 TeV. Eur. Phys. J. C 74(12), 3174

(2014). http://arxiv.org/abs/1407.5066

Web End =arXiv:1407.5066 11. M. Kramer, A. Kulesza, R. van der Leeuw, M. Mangano, S. Padhi, et al., Supersymmetry production cross sections in pp TeV. http://arxiv.org/abs/1206.2892

Web End =arXiv:1206.2892

12. G. Bozzi, J. Rojo, A. Vicini, The impact of PDF uncertainties on the measurement of the W boson mass at the Tevatron and the LHC. Phys. Rev. D 83, 113008 (2011). http://arxiv.org/abs/1104.2056

Web End =arXiv:1104.2056

13. ATLAS Collaboration, G. Aad et al., Studies of theoretical uncertainties on the measurement of the mass of the W boson at the LHC. Tech. Rep. ATL-PHYS-PUB-2014-015, CERN, Geneva (2014)

14. G. Bozzi, L. Citelli, A. Vicini, PDF uncertainties on the W boson mass measurement from the lepton transverse momentum distribution. Phys. Rev. D 91(11), 113005 (2015). http://arxiv.org/abs/1501.05587

Web End =arXiv:1501.05587

15. H1 and ZEUS Collaboration, A. Cooper-Sarkar, PDF Fits at HERA. PoS EPS-HEP2011 (2011) 320. http://arxiv.org/abs/1112.2107

Web End =arXiv:1112.2107

16. NNPDF Collaboration, R. D. Ball et al., Parton distributions for the LHC Run II. JHEP 1504, 040 (2015). http://arxiv.org/abs/1410.8849

Web End =arXiv:1410.8849

17. S. Alekhin, J. Bluemlein, S. Moch, The ABM parton distributions tuned to LHC data. Phys. Rev. D 89, 054028 (2014). http://arxiv.org/abs/1310.3059

Web End =arXiv:1310.3059

18. J. Gao, M. Guzzi, J. Huston, H.-L. Lai, Z. Li et al., CT10 next-to-next-to-leading order global analysis of QCD. Phys. Rev. D 89(3), 033009 (2014). http://arxiv.org/abs/1302.6246

Web End =arXiv:1302.6246

19. L.A. Harland-Lang, A.D. Martin, P. Motylinski, R.S. Thorne, Parton distributions in the LHC era: MMHT 2014 PDFs. Eur. Phys. J. C 75(5), 204 (2015). http://arxiv.org/abs/1412.3989

Web End =arXiv:1412.3989

20. A. Accardi, W. Melnitchouk, J. Owens, M. Christy, C. Keppel et al., Uncertainties in determining parton distributions at large x. Phys. Rev. D 84, 014008 (2011). http://arxiv.org/abs/1102.3686

Web End =arXiv:1102.3686

21. P. Jimenez-Delgado, E. Reya, Delineating parton distributions and the strong coupling. Phys. Rev. D 89(7), 074049 (2014). http://arxiv.org/abs/1403.1852

Web End =arXiv:1403.1852

22. S. Dulat, T.J. Hou, J. Gao, M. Guzzi, J. Huston, P. Nadolsky, J. Pumplin, C. Schmidt, D. Stump, C.P. Yuan, The CT14 global analysis of quantum chromodynamics. http://arxiv.org/abs/1506.07443

Web End =arXiv:1506.07443

123

Eur. Phys. J. C (2015) 75:474 Page 31 of 32 474

23. S. Alekhin, J. Blmlein, P. Jimenez-Delgado, S. Moch, E. Reya, NNLO benchmarks for Gauge and Higgs boson production at TeV hadron colliders. Phys. Lett. B 697, 127135 (2011). http://arxiv.org/abs/1011.6259

Web End =arXiv:1011.6259

24. M. Botje et al., The PDF4LHC Working Group Interim Recommendations. http://arxiv.org/abs/1101.0538

Web End =arXiv:1101.0538

25. S. Alekhin, S. Alioli, R.D. Ball, V. Bertone, J. Blmlein, et al., The PDF4LHC Working Group Interim Report. http://arxiv.org/abs/1101.0536

Web End =arXiv:1101.0536 26. P.M. Nadolsky et al., Implications of CTEQ global analysis for collider observables. Phys. Rev. D 78, 013004 (2008). http://arxiv.org/abs/0802.0007

Web End =arXiv:0802.0007

27. A.D. Martin, W.J. Stirling, R.S. Thorne, G. Watt, Parton distributions for the LHC. Eur. Phys. J. C 63, 189285 (2009). http://arxiv.org/abs/0901.0002

Web End =arXiv:0901.0002

28. The NNPDF Collaboration, R.D. Ball et al., A rst unbiased global NLO determination of parton distributions and their uncertainties. Nucl. Phys. B 838, 136206 (2010). http://arxiv.org/abs/1002.4407

Web End =arXiv:1002.4407

29. H.-L. Lai et al., Uncertainty induced by QCD coupling in the CTEQ global analysis of parton distributions. Phys. Rev. D 82, 054021 (2010). http://arxiv.org/abs/1004.4624

Web End =arXiv:1004.4624

30. A.D. Martin, W.J. Stirling, R.S. Thorne, G. Watt, Uncertainties on S in global PDF analyses. Eur. Phys. J. C 64, 653680 (2009).

http://arxiv.org/abs/0905.3531

Web End =arXiv:0905.3531 31. F. Demartin, S. Forte, E. Mariani, J. Rojo, A. Vicini, The impact of PDF and s uncertainties on Higgs Production in gluon fusion at hadron colliders. Phys. Rev. D 82, 014002 (2010). http://arxiv.org/abs/1004.0962

Web End =arXiv:1004.0962

32. PDF4LHC Steering Committee. PDF4LHC Recommendations. http://www.hep.ucl.ac.uk/pdf4lhc/PDF4LHCrecom.pdf

Web End =http://www.hep.ucl.ac.uk/pdf4lhc/PDF4LHCrecom.pdf

33. PDF4LHC Steering Committee, Procedure for Adding PDF and s Uncertainties. http://www.hep.ucl.ac.uk/pdf4lhc/alphasprop.pdf

Web End =http://www.hep.ucl.ac.uk/pdf4lhc/alphasprop.

http://www.hep.ucl.ac.uk/pdf4lhc/alphasprop.pdf

Web End =pdf 34. Particle Data Group Collaboration, K. Olive et al., Review of particle physics. Chin. Phys. C 38, 090001 (2014)

35. G. Watt, R.S. Thorne, Study of Monte Carlo approach to experimental uncertainty propagation with MSTW 2008 PDFs. JHEP 1208, 052 (2012). Supplementary material available from http://mstwpdf.hepforge.org/random/

Web End =http:// http://mstwpdf.hepforge.org/random/

Web End =mstwpdf.hepforge.org/random/ . http://arxiv.org/abs/1205.4024

Web End =arXiv:1205.4024

36. G. Watt, Talk presented at PDF4LHC meeting, CERN, Geneva (2013). http://indico.cern.ch/event/244768/contribution/5

Web End =http://indico.cern.ch/event/244768/contribution/5

37. J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni et al., The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations. JHEP 1407, 079 (2014). http://arxiv.org/abs/1405.0301

Web End =arXiv:1405.0301

38. R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, R. Pittau et al., Four-lepton production at hadron colliders: aMC@NLO predictions with theoretical uncertainties. JHEP 1202, 099 (2012). http://arxiv.org/abs/1110.4738

Web End =arXiv:1110.4738

39. S. Alioli, P. Nason, C. Oleari, E. Re, A general framework for implementing NLO calculations in shower Monte Carlo programs: the POWHEG BOX. JHEP 1006, 043 (2010). http://arxiv.org/abs/1002.2581

Web End =arXiv:1002.2581

40. R. Gavin, Y. Li, F. Petriello, S. Quackenbush, W physics at the LHC with FEWZ 2.1. Comput. Phys. Commun. 184, 208214 (2013). http://arxiv.org/abs/1201.5896

Web End =arXiv:1201.5896

41. J. Gao, P. Nadolsky, A meta-analysis of parton distribution functions. JHEP 1407, 035 (2014). http://arxiv.org/abs/1401.0013

Web End =arXiv:1401.0013

42. NNPDF Collaboration, R.D. Ball, V. Bertone, S. Carrazza, C.S.

Deans, L. Del Debbio, et al., Parton distributions with LHC data. Nucl. Phys. B 867, 244289 (2013). http://arxiv.org/abs/1207.1303

Web End =arXiv:1207.1303

43. J. Pumplin, Parametrization dependence and 2 in parton distribution tting. Phys. Rev. D 82, 114020 (2010). http://arxiv.org/abs/0909.5176

Web End =arXiv:0909.5176

44. S. Carrazza, S. Forte, Z. Kassabov, J.I. Latorre, J. Rojo, An unbiased Hessian representation for Monte Carlo PDFs. Eur. Phys. J. C 75(8), 369 (2015). http://arxiv.org/abs/1505.06736

Web End =arXiv:1505.06736

45. J. Rojo, J.I. Latorre, Neural network parametrization of spectral functions from hadronic tau decays and determination of qcd vacuum condensates. JHEP 01, 055 (2004). http://arxiv.org/abs/hep-ph/0401047

Web End =arXiv:hep-ph/0401047

46. The NNPDF Collaboration, L. Del Debbio, S. Forte, J.I. Latorre,A. Piccione, J. Rojo, Unbiased determination of the proton structure function f 2(p) with estimation. JHEP 03, 080 (2005). http://arxiv.org/abs/hep-ph/0501067

Web End =arXiv:hep-ph/0501067 47. The NNPDF Collaboration, R.D. Ball et al., Reweighting NNPDFs: the W lepton asymmetry. Nucl. Phys. B 849, 112143 (2011). http://arxiv.org/abs/1012.0836

Web End =arXiv:1012.0836

48. R.D. Ball, V. Bertone, F. Cerutti, L. Del Debbio, S. Forte et al., Reweighting and unweighting of parton distributions and the LHC W lepton asymmetry data. Nucl. Phys. B 855, 608638 (2012). http://arxiv.org/abs/1108.1758

Web End =arXiv:1108.1758

49. A. Buckley, J. Ferrando, S. Lloyd, K. Nordstrm, B. Page et al., LHAPDF6: parton density access in the LHC precision era. Eur. Phys. J. C 75(3), 132 (2015). http://arxiv.org/abs/1412.7420

Web End =arXiv:1412.7420

50. S. Carrazza, A compression tool for Monte Carlo PDF sets. https://github.com/scarrazza/compressor

Web End =https:// https://github.com/scarrazza/compressor

Web End =github.com/scarrazza/compressor

51. G.P. Salam, J. Rojo, A higher order perturbative parton evolution toolkit (HOPPET). Comput. Phys. Commun. 180, 120156 (2009). http://arxiv.org/abs/0804.3755

Web End =arXiv:0804.3755

52. R.D. Ball, M. Bonvini, S. Forte, S. Marzani, G. Ridol, Higgs production in gluon fusion beyond NNLO. Nucl. Phys. B 874, 746772 (2013). http://arxiv.org/abs/1303.3590

Web End =arXiv:1303.3590

53. M. Czakon, A. Mitov, Top++: a program for the calculation of the top-pair cross-section at hadron colliders. Comput. Phys. Commun. 185, 2930 (2014). http://arxiv.org/abs/1112.5675

Web End =arXiv:1112.5675

54. C. Anastasiou, L.J. Dixon, K. Melnikov, F. Petriello, High precision QCD at hadron colliders: Electroweak gauge boson rapidity distributions at NNLO. Phys. Rev. D 69, 094008 (2004). http://arxiv.org/abs/hep-ph/0312266

Web End =arXiv:hep-ph/0312266

55. LHC Higgs Cross Section Working Group, S. Dittmaier, C. Mariotti, G. Passarino, R. Tanaka (Eds.), Handbook of LHC Higgs Cross Sections: 2. Differential Distributions, CERN-2012-002 (CERN, Geneva, 2012). http://arxiv.org/abs/1201.3084

Web End =arXiv:1201.3084

56. G.H. Hardy, J.E. Littlewood, G. Plya, Inequalities (Cambridge University Press, London, 1978)

57. T.M. Cover, Elements of information theory (Wiley, London, 2006)58. V. Bertone, S. Carrazza, J. Rojo, APFEL: a PDF evolution library with QED corrections. Comput. Phys. Commun. 185, 16471668 (2014). http://arxiv.org/abs/1310.1394

Web End =arXiv:1310.1394

59. S. Carrazza, A. Ferrara, D. Palazzo, J. Rojo, APFEL Web: a web-based application for the graphical visualization of parton distribution functions. J. Phys. G 42(5), 057001 (2015). http://arxiv.org/abs/1410.5456

Web End =arXiv:1410.5456

60. J.M. Campbell, R.K. Ellis, Radiative corrections to Z b anti-b production. Phys. Rev. D 62, 114012 (2000). http://arxiv.org/abs/hep-ph/0006304

Web End =arXiv:hep-ph/0006304

61. Z. Nagy, Next-to-leading order calculation of three-jet observables in hadron hadron collision. Phys. Rev. D 68, 094002 (2003). http://arxiv.org/abs/hep-ph/0307268

Web End =arXiv:hep-ph/0307268

62. T. Carli, D. Clements, A. Cooper-Sarkar, C. Gwenlan, G.P. Salam et al., A posteriori inclusion of parton density functions in NLO QCD nal-state calculations at hadron colliders: The APPLGRID Project. Eur. Phys. J. C 66, 503524 (2010). http://arxiv.org/abs/0911.2985

Web End =arXiv:0911.2985

63. V. Bertone, R. Frederix, S. Frixione, J. Rojo, M. Sutton, aMCfast: automation of fast NLO computations for PDF ts. JHEP 1408, 166 (2014). http://arxiv.org/abs/1406.7693

Web End =arXiv:1406.7693

64. ATLAS Collaboration, G. Aad et al., Measurement of the high-mass DrellYan differential cross-section in pp collisions at s =

7 TeV with the ATLAS detector. Phys. Lett. B 725, 223242 (2013). http://arxiv.org/abs/1305.4192

Web End =arXiv:1305.4192 65. CMS Collaboration Collaboration, S. Chatrchyan et al., Measurement of the differential and double-differential DrellYan cross sections in protonproton collisions at s = 7 TeV. JHEP 1312,

030 (2013). http://arxiv.org/abs/1310.7291

Web End =arXiv:1310.7291 66. CMS Collaboration, S. Chatrchyan et al., Measurement of the muon charge asymmetry in inclusive pp to W X production at s = 7

TeV and an improved determination of light parton distribution functions. Phys. Rev. D 90, 032004 (2014). http://arxiv.org/abs/1312.6283

Web End =arXiv:1312.6283

123

474 Page 32 of 32 Eur. Phys. J. C (2015) 75:474

67. CMS Collaboration, S. Chatrchyan et al., Measurement of associated W+ charm production in pp collisions at s = 7 TeV. JHEP 1402, 013 (2014). http://arxiv.org/abs/1310.1138

Web End =arXiv:1310.1138

68. ATLAS Collaboration, G. Aad et al., Measurement of inclusive jet and dijet production in pp collisions at s = 7 TeV using the ATLAS detector. Phys. Rev. D 86, 014022 (2012). http://arxiv.org/abs/1112.6297

Web End =arXiv:1112.6297

69. J. Butterworth, G. Dissertori, S. Dittmaier, D. de Florian, N. Glover, et al., Les Houches 2013: Physics at TeV Colliders: Standard Model Working Group Report. http://arxiv.org/abs/1405.1067

Web End =arXiv:1405.1067

70. J. Rojo et al., Chapter 22 in: J. R. Andersen et al., The SM and NLO multileg working group: Summary report (2010). http://arxiv.org/abs/1003.1241

Web End =arXiv:1003.1241

123

Word count: 19109

Show less

SIF and Springer-Verlag Berlin Heidelberg 2015

Abstract

Translate

The current PDF4LHC recommendation to estimate uncertainties due to parton distribution functions (PDFs) in theoretical predictions for LHC processes involves the combination of separate predictions computed using PDF sets from different groups, each of which comprises a relatively large number of either Hessian eigenvectors or Monte Carlo (MC) replicas. While many fixed-order and parton shower programs allow the evaluation of PDF uncertainties for a single PDF set at no additional CPU cost, this feature is not universal, and, moreover, the a posteriori combination of the predictions using at least three different PDF sets is still required. In this work, we present a strategy for the statistical combination of individual PDF sets, based on the MC representation of Hessian sets, followed by a compression algorithm for the reduction of the number of MC replicas. We illustrate our strategy with the combination and compression of the recent NNPDF3.0, CT14 and MMHT14 NNLO PDF sets. The resulting compressed Monte Carlo PDF sets are validated at the level of parton luminosities and LHC inclusive cross sections and differential distributions. We determine that around 100 replicas provide an adequate representation of the probability distribution for the original combined PDF set, suitable for general applications to LHC phenomenology.

Details

Title

A compression algorithm for the combination of PDF sets

Author

Carrazza, Stefano; Latorre, José I; Rojo, Juan; Watt, Graeme

Pages

1-32

Publication year

2015

Publication date

Oct 2015

Publisher

Springer Nature B.V.

ISSN

14346044

e-ISSN

14346052

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1140/epjc/s10052-015-3703-3

ProQuest document ID

1732878881

SIF and Springer-Verlag Berlin Heidelberg 2015

A compression algorithm for the combination of PDF sets

Jump to:

Full text

Abstract

Details

Suggested sources