Skip to main content

iDev : Statistical Formulas For Programmers


Image


Here the list of statistics formulas which is used in daily life:
  1. Formulas For Reporting Averages
    1. Corrected Standard Deviation
    2. Standard Error of the Mean
    3. Confidence Interval Around the Mean
    4. Two-Sample T-Test
  2. Formulas For Reporting Proportions
    1. Confidence Interval of a Bernoulli Parameter
    2. Multinomial Confidence Intervals
    3. Chi-Squared Test
  3. Formulas For Reporting Count Data
    1. Standard Deviation of a Poisson Distribution
    2. Confidence Interval Around the Poisson Parameter
    3. Conditional Test of Two Poisson Parameters
  4. Formulas For Comparing Distributions
    1. Comparing an Empirical Distribution to a Known Distribution
    2. Comparing Two Empirical Distributions
    3. Comparing Three or More Empirical Distributions
  5. Formulas For Drawing a Trend Line
    1. Slope of a Best-Fit Trend Line
    2. Standard Error of the Slope
    3. Confidence Interval Around the Slope

1. Formulas For Reporting Averages

One of the first programming lessons in any language is to compute an average. But rarely does anyone stop to ask: what does the average actually tell us about the underlying data?

1.1 CORRECTED STANDARD DEVIATION

The standard deviation is a single number that reflects how spread out the data actually is. It should be reported alongside the average (unless the user will be confused).
s=1N−1∑i=1N(xi−x¯)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾⎷
Where:
  • N is the number of observations
  • xi is the value of the ith observation
  • x¯ is the average value of xi

1.2 STANDARD ERROR OF THE MEAN

From a statistical point of view, the "average" is really just an estimate of an underlying population mean. That estimate has uncertainty that is summarized by the standard error.
SE=sN‾‾√

1.3 CONFIDENCE INTERVAL AROUND THE MEAN

A confidence interval reflects the set of statistical hypotheses that won't be rejected at a given significance level. So the confidence interval around the mean reflects all possible values of the mean that can't be rejected by the data. It is a multiple of the standard error added to and subtracted from the mean.
CI=x¯±tα/2SE
Where:
  • α is the significance level, typically 5% (one minus the confidence level)
  • tα/2 is the 1−α/2 quantile of a t-distribution with N−1 degrees of freedom

1.4 TWO-SAMPLE T-TEST

A two-sample t-test can tell whether two groups of observations differ in their mean.
The test statistic is given by:
t=x1¯−x2¯s21/n1+s22/n2‾‾‾‾‾‾‾‾‾‾‾‾√
The hypothesis of equal means is rejected if |t| exceeds the (1−α/2) quantile of a t distribution with degrees of freedom equal to:
df=(s21/n1+s22/n2)2(s21/n1)2/(n1−1)+(s22/n2)2/(n2−1)
You can see a demonstration of these concepts in Evan's Awesome Two-Sample T-Test.

2. Formulas For Reporting Proportions

It's common to report the relative proportions of binary outcomes or categorical data, but in general these are meaningless without confidence intervals and tests of independence.

2.1 CONFIDENCE INTERVAL OF A BERNOULLI PARAMETER

A Bernoulli parameter is the proportion underlying a binary-outcome event (for example, the percent of the time a coin comes up heads). The confidence interval is given by:
CI=(p+z2α/22N±zα/2[p(1−p)+z2α/2/4N]/N‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√)/(1+z2α/2)
Where:
  • p is the observed proportion of interest
  • zα/2 is the (1−α/2) quantile of a normal distribution
This formula can also be used as a sorting criterion.

2.2 MULTINOMIAL CONFIDENCE INTERVALS

If you have more than two categories, a multinomial confidence interval supplies upper and lower confidence limits on all of the category proportions at once. The formula is nearly identical to the preceding one.
CI=(pj+z2α/22N±zα/2[pj(1−pj)+z2α/2/4N]/N‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√)/(1+z2α/2)
Where:
  • pj is the observed proportion of the jth category

2.3 CHI-SQUARED TEST

Pearson's chi-squared test can detect whether the distribution of row counts seem to differ across columns (or vice versa). It is useful when comparing two or more sets of category proportions.
The test statistic, called X2, is computed as:
X2=∑i=1n∑j=1m(Oi,j−Ei,j)2Ei,j
Where:
  • n is the number of rows
  • m is the number of columns
  • Oi,j is the observed count in row i and column j
  • Ei,j is the expected count in row i and column j
The expected count is given by:
Ei,j=∑nk=1Ok,j∑ml=1Oi,lN
A statistical dependence exists if X2 is greater than the (1−α) quantile of a Ï‡2 distribution with (m−1)×(n−1) degrees of freedom.
You can see a 2x2 demonstration of these concepts in Evan's Awesome Chi-Squared Test.

3. Formulas For Reporting Count Data

If the incoming events are independent, their counts are well-described by a Poisson distribution. A Poisson distribution takes a parameter Î», which is the distribution's mean — that is, the average arrival rate of events per unit time.

3.1. STANDARD DEVIATION OF A POISSON DISTRIBUTION

The standard deviation of Poisson data usually doesn't need to be explicitly calculated. Instead it can be inferred from the Poisson parameter:
σ=λ√
This fact can be used to read an unlabeled sales chart, for example.

3.2. CONFIDENCE INTERVAL AROUND THE POISSON PARAMETER

The confidence interval around the Poisson parameter represents the set of arrival rates that can't be rejected by the data. It can be inferred from a single data point of c events observed over t time periods with the following formula:
CI=(γ−1(α/2,c)t,γ−1(1−α/2,c+1)t)
Where:
  • γ−1(p,c) is the inverse of the lower incomplete gamma function

3.3. CONDITIONAL TEST OF TWO POISSON PARAMETERS

Please never do this:
From a statistical point of view, 5 events is indistinguishable from 7 events. Before reporting in bright red text that one count is greater than another, it's best to perform a test of the two Poisson means.
The p-value is given by:
p=2×c!tc×min⎧⎩⎨⎪⎪∑i=0c1ti1tc−i2i!(c−i)!,∑i=c1cti1tc−i2i!(c−i)!⎫⎭⎬⎪⎪
Where:
  • Observation 1 consists of c1 events over t1 time periods
  • Observation 2 consists of c2 events over t2 time periods
  • c=c1+c2 and t=t1+t2
You can see a demonstration of these concepts in Evan's Awesome Poisson Means Test.

4. Formulas For Comparing Distributions

If you want to test whether groups of observations come from the same (unknown) distribution, or if a single group of observations comes from a known distribution, you'll need a Kolmogorov-Smirnov test. A K-S test will test the entire distribution for equality, not just the distribution mean.

4.1. COMPARING AN EMPIRICAL DISTRIBUTION TO A KNOWN DISTRIBUTION

The simplest version is a one-sample K-S test, which compares a sample of n points having an observed cumulative distribution function F to a known distribution function having a c.d.f. of G. The test statistic is:
Dn=supx|F(x)−G(x)|
In plain English, Dn is the absolute value of the largest difference in the two c.d.f.s for any value of x.
The critical value of Dn is given by Kα/n‾√, where Kα is the value of x that solves:
1−α=2Ï€‾‾‾√x∑k=1∞exp(−(2k−1)2Ï€2/(8x2))
The critical must be solved iteratively, e.g. by Newton's method. If only the p-value is needed, it can be computed directly by solving the above for Î±.

4.2. COMPARING TWO EMPIRICAL DISTRIBUTIONS

The two-sample version is similar, except the test statistic is given by:
Dn1,n2=supx|F1(x)−F2(x)|
Where F1 and F2 are the empirical c.d.f.s of the two samples, having n1 and n2 observations, respectively. The critical value of the test statistic is Kα/n1n2/(n1+n2)‾‾‾‾‾‾‾‾‾‾‾‾‾‾√ with the same value of Kαabove.

4.3. COMPARING THREE OR MORE EMPIRICAL DISTRIBUTIONS

A k-sample extension of Kolmogorov-Smirnov was described by J. Kiefer in a 1959 paper. The test statistic is:
T=supx∑j=1knj|Fj(x)−F¯(x)|
Where F¯ is the c.d.f. of the combined samples. The critical value of T is a2 where a solves:
1−α=4Γ(h2)2h/2ah∑n=1∞(γ(h−2)/2,n)h−2exp[−(γ(h−2)/2,n)2/2a2][Jh/2(γ(h−2)/2,n)]2
Where:
  • h=k−1
  • Jh/2 is a Bessel function of the first kind with order h/2
  • γ(h−2)/2,n is the nth zero of J(h−2)/2
To compute the critical value, this equation must also be solved iteratively. When k=2, the equation reduces to a two-sample Kolmogorov-Smirnov test. The case of k=4 can also be reduced to a simpler form, but for other values of k, the equation cannot be reduced.

5. Formulas For Drawing a Trend Line

Trend lines (or best-fit lines) can be used to establish a relationship between two variables and predict future values.

5.1. SLOPE OF A BEST-FIT LINE

The slope of a best-fit (least squares) line is:
m=∑Ni=1(xi−x¯)(yi−y¯)∑Ni=1(xi−x¯)2
Where:
  • {x1,…,xN} is the independent variable with sample mean x¯
  • {y1,…,yN} is the dependent variable with sample mean y¯

5.2. STANDARD ERROR OF THE SLOPE

The standard error around the estimated slope is:
SE=∑Ni=1(yi−y¯−m(xi−x¯))2/(N−2)‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√∑Ni=1(xi−x¯)2‾‾‾‾‾‾‾‾‾‾‾‾‾√

5.3. CONFIDENCE INTERVAL AROUND THE SLOPE

The confidence interval is constructed as:
CI=m±tα/2SE
Where:
  • α is the significance level, typically 5% (one minus the confidence level)
  • tα/2 is the 1−α/2 quantile of a t-distribution with N−2 degrees of freedom
Reference Site : See Here
Keep Coding :)
Thanks :)

Comments

Popular posts from this blog

WPF-MVVM: RelayCommand Implementation

In WPF if we are implementing MVVM pattern then we need to play with Command rather than Events. You can use ICommand interface to create each command class. Implementation of ICommand in a class gives you CanExecute(), Execute() methods which take part in the action performed by Command.   Rather than making Command Class for each Command we can implement a generic Relay Command to get Command. Below is a RelayCommand class that we will implement.   ///   <summary>      ///  To register commands in MMVM pattern      ///   </summary>      class   RelayCommands  :  ICommand     {          readonly   Action < object > _execute;          readonly   Predicate < object > _canExecute;  ...

.Net List with Changed event

Sometimes we need a List which can notify user when an item is added. Here is the way that you can implement a generic ArrayList which notifies user at the time of an element is added.   using  System; using  System.Collections; namespace  ArchiveData.Logging {    // A delegate type for hooking up change notifications.    public   delegate   void   ChangedEventHandler ( object  sender,  EventArgs  e);    public   class   ListWithChangedEvent  :  ArrayList   {      // An event that clients can use to be notified whenever the      // elements of the list change.      public   event   ChangedEventHandler  Changed;      public   object  NewlyAddedItem {...

What is DispatcherTimer in wpf?

DispatcherTimer When you want to set a timer working with GUI, you always come across threading problem. The problem is that if you want to send some changes to UI that is constantly/continuously changing then that will make your UI unresponsive or in other words it will hang your UI.   To overcome from this situation, WPF gives us DispatcherTimer threading functionality that will take care of such continuously changing processing on UI thread and that will not hang your UI. We can accomplish same scenario in Win Form , through System.Windows.Forms.Timer and in WPF it is System.Windows.Threading.DispatcherTimer .   Difference between DispatcherTimer and Regular timer (System.Timers.Timer) DispatcherTimer is the regular timer. It fires its Tick event on the UI thread, you can do anything you want with the UI. System.Timers.Timer is an asynchronous timer, its Elapsed event runs on a thread pool thread. You have to be very careful in your event handler...