Sums of normal random variables The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Perfectly correlated (normal) random variablesNormal approximation to the binomial distributionTransforming two normal random variablesVariance of random variable for normal distributionDetermining characteristics of peaks after mclust finite mixture modelWhat is probability that one normal random variable is max of three normal random variables?Are two standard normal random variables always independent?Representation of equicorrelated normal random variablesconcatenating two normal random variablesNormal random variables arithmetics?

Separating matrix elements by lines

"is" operation returns false even though two objects have same id

One-dimensional Japanese puzzle

How to read αἱμύλιος or when to aspirate

Do working physicists consider Newtonian mechanics to be "falsified"?

Are there continuous functions who are the same in an interval but differ in at least one other point?

Windows 10: How to Lock (not sleep) laptop on lid close?

What is the role of 'For' here?

Does Parliament need to approve the new Brexit delay to 31 October 2019?

Why are PDP-7-style microprogrammed instructions out of vogue?

ELI5: Why do they say that Israel would have been the fourth country to land a spacecraft on the Moon and why do they call it low cost?

Can we generate random numbers using irrational numbers like π and e?

How do you keep chess fun when your opponent constantly beats you?

How to determine omitted units in a publication

Visa regaring travelling European country

Loose spokes after only a few rides

What's the point in a preamp?

Sub-subscripts in strings cause different spacings than subscripts

What is the padding with red substance inside of steak packaging?

The following signatures were invalid: EXPKEYSIG 1397BC53640DB551

What aspect of planet Earth must be changed to prevent the industrial revolution?

Sort list of array linked objects by keys and values

Could an empire control the whole planet with today's comunication methods?

My body leaves; my core can stay



Sums of normal random variables



The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)Perfectly correlated (normal) random variablesNormal approximation to the binomial distributionTransforming two normal random variablesVariance of random variable for normal distributionDetermining characteristics of peaks after mclust finite mixture modelWhat is probability that one normal random variable is max of three normal random variables?Are two standard normal random variables always independent?Representation of equicorrelated normal random variablesconcatenating two normal random variablesNormal random variables arithmetics?



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








7












$begingroup$


Consider a sample of n independent normal rvs. I would like to identify a systematic way of calculating the probability of having the sum of a subset of them larger than the sum of the rest of rvs.
An example case:
Population of fish. Mean: 10 kg, stdv: 3 kg.
I fish five fish (n=5). What is the probability of having two fish weighing more than the rest of the three fish?
The steps which can be followed is to calculate the prob for every combination of fish and then use the inclusion exclusion formula for their union. Is there anything smarter?
Note: if four fish were considered the probability of having two of them heavier than the other two should be one. How could this be computed immediately?
Thanks for the answers.










share|cite|improve this question











$endgroup$







  • 1




    $begingroup$
    You could certainly do simulation.
    $endgroup$
    – Peter Flom
    yesterday










  • $begingroup$
    @Peter You might be interested to know there is an easily calculated theoretical answer, then.
    $endgroup$
    – whuber
    yesterday










  • $begingroup$
    @whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
    $endgroup$
    – Dason
    yesterday






  • 1




    $begingroup$
    @Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
    $endgroup$
    – whuber
    yesterday










  • $begingroup$
    @whuber - agreed. Definitely a much harder problem to get a direct solution for and for cases with larger samples an approximate calculation via simulation might be the most reasonable approach.
    $endgroup$
    – Dason
    yesterday

















7












$begingroup$


Consider a sample of n independent normal rvs. I would like to identify a systematic way of calculating the probability of having the sum of a subset of them larger than the sum of the rest of rvs.
An example case:
Population of fish. Mean: 10 kg, stdv: 3 kg.
I fish five fish (n=5). What is the probability of having two fish weighing more than the rest of the three fish?
The steps which can be followed is to calculate the prob for every combination of fish and then use the inclusion exclusion formula for their union. Is there anything smarter?
Note: if four fish were considered the probability of having two of them heavier than the other two should be one. How could this be computed immediately?
Thanks for the answers.










share|cite|improve this question











$endgroup$







  • 1




    $begingroup$
    You could certainly do simulation.
    $endgroup$
    – Peter Flom
    yesterday










  • $begingroup$
    @Peter You might be interested to know there is an easily calculated theoretical answer, then.
    $endgroup$
    – whuber
    yesterday










  • $begingroup$
    @whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
    $endgroup$
    – Dason
    yesterday






  • 1




    $begingroup$
    @Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
    $endgroup$
    – whuber
    yesterday










  • $begingroup$
    @whuber - agreed. Definitely a much harder problem to get a direct solution for and for cases with larger samples an approximate calculation via simulation might be the most reasonable approach.
    $endgroup$
    – Dason
    yesterday













7












7








7


3



$begingroup$


Consider a sample of n independent normal rvs. I would like to identify a systematic way of calculating the probability of having the sum of a subset of them larger than the sum of the rest of rvs.
An example case:
Population of fish. Mean: 10 kg, stdv: 3 kg.
I fish five fish (n=5). What is the probability of having two fish weighing more than the rest of the three fish?
The steps which can be followed is to calculate the prob for every combination of fish and then use the inclusion exclusion formula for their union. Is there anything smarter?
Note: if four fish were considered the probability of having two of them heavier than the other two should be one. How could this be computed immediately?
Thanks for the answers.










share|cite|improve this question











$endgroup$




Consider a sample of n independent normal rvs. I would like to identify a systematic way of calculating the probability of having the sum of a subset of them larger than the sum of the rest of rvs.
An example case:
Population of fish. Mean: 10 kg, stdv: 3 kg.
I fish five fish (n=5). What is the probability of having two fish weighing more than the rest of the three fish?
The steps which can be followed is to calculate the prob for every combination of fish and then use the inclusion exclusion formula for their union. Is there anything smarter?
Note: if four fish were considered the probability of having two of them heavier than the other two should be one. How could this be computed immediately?
Thanks for the answers.







normal-distribution independence






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited yesterday









Tim

60.1k9133229




60.1k9133229










asked yesterday









ManosManos

412




412







  • 1




    $begingroup$
    You could certainly do simulation.
    $endgroup$
    – Peter Flom
    yesterday










  • $begingroup$
    @Peter You might be interested to know there is an easily calculated theoretical answer, then.
    $endgroup$
    – whuber
    yesterday










  • $begingroup$
    @whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
    $endgroup$
    – Dason
    yesterday






  • 1




    $begingroup$
    @Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
    $endgroup$
    – whuber
    yesterday










  • $begingroup$
    @whuber - agreed. Definitely a much harder problem to get a direct solution for and for cases with larger samples an approximate calculation via simulation might be the most reasonable approach.
    $endgroup$
    – Dason
    yesterday












  • 1




    $begingroup$
    You could certainly do simulation.
    $endgroup$
    – Peter Flom
    yesterday










  • $begingroup$
    @Peter You might be interested to know there is an easily calculated theoretical answer, then.
    $endgroup$
    – whuber
    yesterday










  • $begingroup$
    @whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
    $endgroup$
    – Dason
    yesterday






  • 1




    $begingroup$
    @Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
    $endgroup$
    – whuber
    yesterday










  • $begingroup$
    @whuber - agreed. Definitely a much harder problem to get a direct solution for and for cases with larger samples an approximate calculation via simulation might be the most reasonable approach.
    $endgroup$
    – Dason
    yesterday







1




1




$begingroup$
You could certainly do simulation.
$endgroup$
– Peter Flom
yesterday




$begingroup$
You could certainly do simulation.
$endgroup$
– Peter Flom
yesterday












$begingroup$
@Peter You might be interested to know there is an easily calculated theoretical answer, then.
$endgroup$
– whuber
yesterday




$begingroup$
@Peter You might be interested to know there is an easily calculated theoretical answer, then.
$endgroup$
– whuber
yesterday












$begingroup$
@whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
$endgroup$
– Dason
yesterday




$begingroup$
@whuber - You give a great answer assuming that we have a specific two in mind (or randomly choose two). My initial pass at reading thought it was asking about if there were any subsets of 2 such that the sum was greater than the remaining (as evidenced by their claim that if there were 4 fish then the probability would be 1) in which case we would want to look at the distribution of the biggest two vs the distribution of the remaining and would have to dive into the order statistics. Simulation suggests in this situation the probability is roughly .464.
$endgroup$
– Dason
yesterday




1




1




$begingroup$
@Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
$endgroup$
– whuber
yesterday




$begingroup$
@Dason Thank you for pointing that out: it is a very plausible interpretation and one I had not conceived of. It also explains why Peter was suggesting simulation, because that's a much trickier problem. I think you're correct about order statistics, because we can reframe the problem as asking "what is the chance that the sum of the $k$ largest of $n$ values exceeds the sum of the $n-k$ smallest ones?" Although we can write down the value as an integral, in general it requires numerical evaluation and rapidly gets onerous as $n$ grows.
$endgroup$
– whuber
yesterday












$begingroup$
@whuber - agreed. Definitely a much harder problem to get a direct solution for and for cases with larger samples an approximate calculation via simulation might be the most reasonable approach.
$endgroup$
– Dason
yesterday




$begingroup$
@whuber - agreed. Definitely a much harder problem to get a direct solution for and for cases with larger samples an approximate calculation via simulation might be the most reasonable approach.
$endgroup$
– Dason
yesterday










1 Answer
1






active

oldest

votes


















7












$begingroup$

Your example suggests that not only are the $n$ variables $X_1,X_2,ldots,X_n$ independent, they also have the same Normal distribution. Let its parameters be $mu$ (the mean) and $sigma^2$ (the variance) and suppose the subset consists of $k$ of these variables. We might as well index the variables so that $X_1,ldots, X_k$ are this subset.



The question asks to compute the chance that the sum of the first $k$ variables equals or exceeds the sum of the rest:



$$p_n,k(mu,sigma) = Pr(X_1+cdots+X_k ge X_k+1+cdots+X_n ) = Pr(Y le 0)$$



where



$$Y = -(X_1+cdots+X_k) + (X_k+1+cdots+X_n).$$



$Y$ is a linear combination of independent Normal variables and therefore has a Normal distribution--but which one? The laws of expectation and variance immediately tell us



$$E[Y] = -kmu + (n-k)mu = (n-2k)mu$$



and



$$operatornameVar(Y) = k sigma^2 + (n-k)sigma^2 = nsigma^2.$$



Therefore $$Z=fracY - (n-2k)musigmasqrtn$$ has a standard Normal distribution with distribution function $Phi,$ whence the answer is




$$p_n,k(mu,sigma) = Pr(Y le 0) = Prleft(Z le -frac(n-2k)musigmasqrtnright) = Phileft(-frac(n-2k)musigmasqrtnright).$$




In the question, $n=5,k=2,mu=10,$ and $sigma=3,$ whence



$$p_5,2(10,3) = Phileft(-frac(5-2(2))103sqrt10right)approx 0.0680186.$$




Generalization



Little needs to change in this analysis even when the $X_i$ have different normal distributions or are even correlated: you only need to assume they have an $n$-variate Normal distribution to assure their linear combination still has a Normal distribution. The calculations are carried out in the same way and result in a similar formula.




Check



A commenter suggested solving this with simulation. Although that wouldn't be a solution, it's a decent way to check a solution quickly. Thus, in R we might establish the inputs of the simulation in some arbitrary way as



n <- 5
k <- 2
mu <- 10
sigma <- 3
n.sim <- 1e6 # Simulation size
set.seed(17) # For reproducible results


and simulate such data and compare the sums with these two lines:



x <- matrix(rnorm(n*n.sim, mu, sigma), ncol=n)
p.hat <- mean(rowSums(x[, 1:k]) >= rowSums(x[, -(1:k)]))


The post-processing consists of finding the fraction of simulated datasets in which one sum exceeds the other and comparing that to the theoretical solution:



se <- sqrt(p.hat * (1-p.hat) / n.sim)
p <- pnorm(-(n-2*k)*mu / (sigma * sqrt(n)))
signif(c(Simulation=p.hat, Theory=p, `Z-score`=(p.hat-p)/se), 3)


The output in this case is




Simulation Theory Z-score 
0.0677 0.0680 -1.1900



The agreement is close and the small absolute z-score allows us to attribute the discrepancy to random fluctuations rather than any error in the theoretical derivation.






share|cite|improve this answer











$endgroup$












  • $begingroup$
    We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
    $endgroup$
    – Acccumulation
    yesterday










  • $begingroup$
    @Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
    $endgroup$
    – whuber
    yesterday











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402405%2fsums-of-normal-random-variables%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









7












$begingroup$

Your example suggests that not only are the $n$ variables $X_1,X_2,ldots,X_n$ independent, they also have the same Normal distribution. Let its parameters be $mu$ (the mean) and $sigma^2$ (the variance) and suppose the subset consists of $k$ of these variables. We might as well index the variables so that $X_1,ldots, X_k$ are this subset.



The question asks to compute the chance that the sum of the first $k$ variables equals or exceeds the sum of the rest:



$$p_n,k(mu,sigma) = Pr(X_1+cdots+X_k ge X_k+1+cdots+X_n ) = Pr(Y le 0)$$



where



$$Y = -(X_1+cdots+X_k) + (X_k+1+cdots+X_n).$$



$Y$ is a linear combination of independent Normal variables and therefore has a Normal distribution--but which one? The laws of expectation and variance immediately tell us



$$E[Y] = -kmu + (n-k)mu = (n-2k)mu$$



and



$$operatornameVar(Y) = k sigma^2 + (n-k)sigma^2 = nsigma^2.$$



Therefore $$Z=fracY - (n-2k)musigmasqrtn$$ has a standard Normal distribution with distribution function $Phi,$ whence the answer is




$$p_n,k(mu,sigma) = Pr(Y le 0) = Prleft(Z le -frac(n-2k)musigmasqrtnright) = Phileft(-frac(n-2k)musigmasqrtnright).$$




In the question, $n=5,k=2,mu=10,$ and $sigma=3,$ whence



$$p_5,2(10,3) = Phileft(-frac(5-2(2))103sqrt10right)approx 0.0680186.$$




Generalization



Little needs to change in this analysis even when the $X_i$ have different normal distributions or are even correlated: you only need to assume they have an $n$-variate Normal distribution to assure their linear combination still has a Normal distribution. The calculations are carried out in the same way and result in a similar formula.




Check



A commenter suggested solving this with simulation. Although that wouldn't be a solution, it's a decent way to check a solution quickly. Thus, in R we might establish the inputs of the simulation in some arbitrary way as



n <- 5
k <- 2
mu <- 10
sigma <- 3
n.sim <- 1e6 # Simulation size
set.seed(17) # For reproducible results


and simulate such data and compare the sums with these two lines:



x <- matrix(rnorm(n*n.sim, mu, sigma), ncol=n)
p.hat <- mean(rowSums(x[, 1:k]) >= rowSums(x[, -(1:k)]))


The post-processing consists of finding the fraction of simulated datasets in which one sum exceeds the other and comparing that to the theoretical solution:



se <- sqrt(p.hat * (1-p.hat) / n.sim)
p <- pnorm(-(n-2*k)*mu / (sigma * sqrt(n)))
signif(c(Simulation=p.hat, Theory=p, `Z-score`=(p.hat-p)/se), 3)


The output in this case is




Simulation Theory Z-score 
0.0677 0.0680 -1.1900



The agreement is close and the small absolute z-score allows us to attribute the discrepancy to random fluctuations rather than any error in the theoretical derivation.






share|cite|improve this answer











$endgroup$












  • $begingroup$
    We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
    $endgroup$
    – Acccumulation
    yesterday










  • $begingroup$
    @Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
    $endgroup$
    – whuber
    yesterday















7












$begingroup$

Your example suggests that not only are the $n$ variables $X_1,X_2,ldots,X_n$ independent, they also have the same Normal distribution. Let its parameters be $mu$ (the mean) and $sigma^2$ (the variance) and suppose the subset consists of $k$ of these variables. We might as well index the variables so that $X_1,ldots, X_k$ are this subset.



The question asks to compute the chance that the sum of the first $k$ variables equals or exceeds the sum of the rest:



$$p_n,k(mu,sigma) = Pr(X_1+cdots+X_k ge X_k+1+cdots+X_n ) = Pr(Y le 0)$$



where



$$Y = -(X_1+cdots+X_k) + (X_k+1+cdots+X_n).$$



$Y$ is a linear combination of independent Normal variables and therefore has a Normal distribution--but which one? The laws of expectation and variance immediately tell us



$$E[Y] = -kmu + (n-k)mu = (n-2k)mu$$



and



$$operatornameVar(Y) = k sigma^2 + (n-k)sigma^2 = nsigma^2.$$



Therefore $$Z=fracY - (n-2k)musigmasqrtn$$ has a standard Normal distribution with distribution function $Phi,$ whence the answer is




$$p_n,k(mu,sigma) = Pr(Y le 0) = Prleft(Z le -frac(n-2k)musigmasqrtnright) = Phileft(-frac(n-2k)musigmasqrtnright).$$




In the question, $n=5,k=2,mu=10,$ and $sigma=3,$ whence



$$p_5,2(10,3) = Phileft(-frac(5-2(2))103sqrt10right)approx 0.0680186.$$




Generalization



Little needs to change in this analysis even when the $X_i$ have different normal distributions or are even correlated: you only need to assume they have an $n$-variate Normal distribution to assure their linear combination still has a Normal distribution. The calculations are carried out in the same way and result in a similar formula.




Check



A commenter suggested solving this with simulation. Although that wouldn't be a solution, it's a decent way to check a solution quickly. Thus, in R we might establish the inputs of the simulation in some arbitrary way as



n <- 5
k <- 2
mu <- 10
sigma <- 3
n.sim <- 1e6 # Simulation size
set.seed(17) # For reproducible results


and simulate such data and compare the sums with these two lines:



x <- matrix(rnorm(n*n.sim, mu, sigma), ncol=n)
p.hat <- mean(rowSums(x[, 1:k]) >= rowSums(x[, -(1:k)]))


The post-processing consists of finding the fraction of simulated datasets in which one sum exceeds the other and comparing that to the theoretical solution:



se <- sqrt(p.hat * (1-p.hat) / n.sim)
p <- pnorm(-(n-2*k)*mu / (sigma * sqrt(n)))
signif(c(Simulation=p.hat, Theory=p, `Z-score`=(p.hat-p)/se), 3)


The output in this case is




Simulation Theory Z-score 
0.0677 0.0680 -1.1900



The agreement is close and the small absolute z-score allows us to attribute the discrepancy to random fluctuations rather than any error in the theoretical derivation.






share|cite|improve this answer











$endgroup$












  • $begingroup$
    We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
    $endgroup$
    – Acccumulation
    yesterday










  • $begingroup$
    @Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
    $endgroup$
    – whuber
    yesterday













7












7








7





$begingroup$

Your example suggests that not only are the $n$ variables $X_1,X_2,ldots,X_n$ independent, they also have the same Normal distribution. Let its parameters be $mu$ (the mean) and $sigma^2$ (the variance) and suppose the subset consists of $k$ of these variables. We might as well index the variables so that $X_1,ldots, X_k$ are this subset.



The question asks to compute the chance that the sum of the first $k$ variables equals or exceeds the sum of the rest:



$$p_n,k(mu,sigma) = Pr(X_1+cdots+X_k ge X_k+1+cdots+X_n ) = Pr(Y le 0)$$



where



$$Y = -(X_1+cdots+X_k) + (X_k+1+cdots+X_n).$$



$Y$ is a linear combination of independent Normal variables and therefore has a Normal distribution--but which one? The laws of expectation and variance immediately tell us



$$E[Y] = -kmu + (n-k)mu = (n-2k)mu$$



and



$$operatornameVar(Y) = k sigma^2 + (n-k)sigma^2 = nsigma^2.$$



Therefore $$Z=fracY - (n-2k)musigmasqrtn$$ has a standard Normal distribution with distribution function $Phi,$ whence the answer is




$$p_n,k(mu,sigma) = Pr(Y le 0) = Prleft(Z le -frac(n-2k)musigmasqrtnright) = Phileft(-frac(n-2k)musigmasqrtnright).$$




In the question, $n=5,k=2,mu=10,$ and $sigma=3,$ whence



$$p_5,2(10,3) = Phileft(-frac(5-2(2))103sqrt10right)approx 0.0680186.$$




Generalization



Little needs to change in this analysis even when the $X_i$ have different normal distributions or are even correlated: you only need to assume they have an $n$-variate Normal distribution to assure their linear combination still has a Normal distribution. The calculations are carried out in the same way and result in a similar formula.




Check



A commenter suggested solving this with simulation. Although that wouldn't be a solution, it's a decent way to check a solution quickly. Thus, in R we might establish the inputs of the simulation in some arbitrary way as



n <- 5
k <- 2
mu <- 10
sigma <- 3
n.sim <- 1e6 # Simulation size
set.seed(17) # For reproducible results


and simulate such data and compare the sums with these two lines:



x <- matrix(rnorm(n*n.sim, mu, sigma), ncol=n)
p.hat <- mean(rowSums(x[, 1:k]) >= rowSums(x[, -(1:k)]))


The post-processing consists of finding the fraction of simulated datasets in which one sum exceeds the other and comparing that to the theoretical solution:



se <- sqrt(p.hat * (1-p.hat) / n.sim)
p <- pnorm(-(n-2*k)*mu / (sigma * sqrt(n)))
signif(c(Simulation=p.hat, Theory=p, `Z-score`=(p.hat-p)/se), 3)


The output in this case is




Simulation Theory Z-score 
0.0677 0.0680 -1.1900



The agreement is close and the small absolute z-score allows us to attribute the discrepancy to random fluctuations rather than any error in the theoretical derivation.






share|cite|improve this answer











$endgroup$



Your example suggests that not only are the $n$ variables $X_1,X_2,ldots,X_n$ independent, they also have the same Normal distribution. Let its parameters be $mu$ (the mean) and $sigma^2$ (the variance) and suppose the subset consists of $k$ of these variables. We might as well index the variables so that $X_1,ldots, X_k$ are this subset.



The question asks to compute the chance that the sum of the first $k$ variables equals or exceeds the sum of the rest:



$$p_n,k(mu,sigma) = Pr(X_1+cdots+X_k ge X_k+1+cdots+X_n ) = Pr(Y le 0)$$



where



$$Y = -(X_1+cdots+X_k) + (X_k+1+cdots+X_n).$$



$Y$ is a linear combination of independent Normal variables and therefore has a Normal distribution--but which one? The laws of expectation and variance immediately tell us



$$E[Y] = -kmu + (n-k)mu = (n-2k)mu$$



and



$$operatornameVar(Y) = k sigma^2 + (n-k)sigma^2 = nsigma^2.$$



Therefore $$Z=fracY - (n-2k)musigmasqrtn$$ has a standard Normal distribution with distribution function $Phi,$ whence the answer is




$$p_n,k(mu,sigma) = Pr(Y le 0) = Prleft(Z le -frac(n-2k)musigmasqrtnright) = Phileft(-frac(n-2k)musigmasqrtnright).$$




In the question, $n=5,k=2,mu=10,$ and $sigma=3,$ whence



$$p_5,2(10,3) = Phileft(-frac(5-2(2))103sqrt10right)approx 0.0680186.$$




Generalization



Little needs to change in this analysis even when the $X_i$ have different normal distributions or are even correlated: you only need to assume they have an $n$-variate Normal distribution to assure their linear combination still has a Normal distribution. The calculations are carried out in the same way and result in a similar formula.




Check



A commenter suggested solving this with simulation. Although that wouldn't be a solution, it's a decent way to check a solution quickly. Thus, in R we might establish the inputs of the simulation in some arbitrary way as



n <- 5
k <- 2
mu <- 10
sigma <- 3
n.sim <- 1e6 # Simulation size
set.seed(17) # For reproducible results


and simulate such data and compare the sums with these two lines:



x <- matrix(rnorm(n*n.sim, mu, sigma), ncol=n)
p.hat <- mean(rowSums(x[, 1:k]) >= rowSums(x[, -(1:k)]))


The post-processing consists of finding the fraction of simulated datasets in which one sum exceeds the other and comparing that to the theoretical solution:



se <- sqrt(p.hat * (1-p.hat) / n.sim)
p <- pnorm(-(n-2*k)*mu / (sigma * sqrt(n)))
signif(c(Simulation=p.hat, Theory=p, `Z-score`=(p.hat-p)/se), 3)


The output in this case is




Simulation Theory Z-score 
0.0677 0.0680 -1.1900



The agreement is close and the small absolute z-score allows us to attribute the discrepancy to random fluctuations rather than any error in the theoretical derivation.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited yesterday

























answered yesterday









whuberwhuber

206k33453823




206k33453823











  • $begingroup$
    We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
    $endgroup$
    – Acccumulation
    yesterday










  • $begingroup$
    @Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
    $endgroup$
    – whuber
    yesterday
















  • $begingroup$
    We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
    $endgroup$
    – Acccumulation
    yesterday










  • $begingroup$
    @Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
    $endgroup$
    – whuber
    yesterday















$begingroup$
We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
$endgroup$
– Acccumulation
yesterday




$begingroup$
We can also assume without loss of generality that $sigma=1$; intuitively, we can calculate everything in terms of $frac musigma$
$endgroup$
– Acccumulation
yesterday












$begingroup$
@Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
$endgroup$
– whuber
yesterday




$begingroup$
@Acccumulation That's correct and it's a good way to proceed. Indeed, this fact follows immediately from observing that one can arbitrarily set the unit of measurement so that $sigma=1$ without changing the problem. I found it convenient not to have to explain this because it didn't appreciably simplify the analysis.
$endgroup$
– whuber
yesterday

















draft saved

draft discarded
















































Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f402405%2fsums-of-normal-random-variables%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







-independence, normal-distribution

Popular posts from this blog

Frič See also Navigation menuinternal link

Identify plant with long narrow paired leaves and reddish stems Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Announcing the arrival of Valued Associate #679: Cesar Manara Unicorn Meta Zoo #1: Why another podcast?What is this plant with long sharp leaves? Is it a weed?What is this 3ft high, stalky plant, with mid sized narrow leaves?What is this young shrub with opposite ovate, crenate leaves and reddish stems?What is this plant with large broad serrated leaves?Identify this upright branching weed with long leaves and reddish stemsPlease help me identify this bulbous plant with long, broad leaves and white flowersWhat is this small annual with narrow gray/green leaves and rust colored daisy-type flowers?What is this chilli plant?Does anyone know what type of chilli plant this is?Help identify this plant

fontconfig warning: “/etc/fonts/fonts.conf”, line 100: unknown “element blank” The 2019 Stack Overflow Developer Survey Results Are In“tar: unrecognized option --warning” during 'apt-get install'How to fix Fontconfig errorHow do I figure out which font file is chosen for a system generic font alias?Why are some apt-get-installed fonts being ignored by fc-list, xfontsel, etc?Reload settings in /etc/fonts/conf.dTaking 30 seconds longer to boot after upgrade from jessie to stretchHow to match multiple font names with a single <match> element?Adding a custom font to fontconfigRemoving fonts from fontconfig <match> resultsBroken fonts after upgrading Firefox ESR to latest Firefox