statistics.rst
author Oleksandr Gavenko <gavenkoa@gmail.com>
Fri, 18 Mar 2016 20:42:18 +0200
changeset 7 c9c0861c10c2
parent 0 328995b5b8fd
permissions -rw-r--r--
Linear function of distribution. Monotonic function of distribution.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
0
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
     1
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
     2
============
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
     3
 Statistics
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
     4
============
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
     5
.. contents::
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
     6
   :local:
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
     7
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
     8
.. role:: def
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
     9
   :class: def
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    10
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    11
Markov inequality
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    12
=================
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    13
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    14
:def:`Markov inequality`: :math:`P(X ≥ a) ≤ E[P]/a` for all :math:`a > 0`.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    15
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    16
Chebyshev inequality
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    17
====================
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    18
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    19
:def:`Chebyshev inequality`: if :math:`X` is a random variable with mean
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    20
:math:`μ` and variance :math:`σ²` then
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    21
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    22
.. math::
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    23
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    24
  P(|X-μ| ≥ c) ≤ σ²/c²
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    25
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    26
for all :math:`c > 0`.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    27
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    28
Central limit theorem
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    29
=====================
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    30
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    31
:def:`Central limit theorem`: let :math:`X_1, ..., X_n, ...` be a sequence of
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    32
independent identically distributed random variables with common mean :math:`μ`
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    33
and variance :math:`σ²` and let:
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    34
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    35
.. math::
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    36
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    37
  Z_n = ((∑_{1≤i≤n} X_i) - n·μ) / (σ·sqrt(n))
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    38
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    39
Then CDF of :math:`Z_n` converge to standard normal CDF:
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    40
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    41
.. math::
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    42
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    43
  Φ(z) = 1/(2·π)·∫_{(-∞;z]} exp(-x²/2) 𝑑x
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    44
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    45
  lim_{n → ∞} P(Z_n ≤ z) = Φ(z)
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    46
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    47
Null hypothesis
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    48
===============
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    49
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    50
:def:`Null hypothesis` a statement that the phenomenon being studied produces no
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    51
effect or makes no difference, assumption that effect actually due to chance.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    52
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    53
p-value
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    54
=======
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    55
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    56
:def:`p-value` is the probability of the apparent effect under the null
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    57
hypothesis.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    58
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    59
  https://en.wikipedia.org/wiki/P-value
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    60
    Wikipedia page
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    61
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    62
Significance level
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    63
==================
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    64
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    65
If the p-value is less than or equal to the chosen :def:`significance level`
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    66
(:math:`α`), the test suggests that the observed data are inconsistent with the
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    67
null hypothesis, so the null hypothesis should be rejected.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    68
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    69
Hypothesis testing
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    70
==================
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    71
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    72
:def:`Hypothesis testing` is process of interpretation of statistical
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    73
significance of given null hypothesis based on observed p-value from sample with
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    74
choosen significance level.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    75
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    76
After finishing hypothesis testing we should reject null hypothesis or fail to
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    77
reject due to lack of enough evidence or ...
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    78
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    79
Hypothesis testing only take into account:
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    80
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    81
* that effect might be due to chance; that is, the difference might appear in a
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    82
  random sample, but not in the general population
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    83
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    84
But it doesn't cover cases:
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    85
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    86
* The effect might be real; that is, a similar difference would be seen in the
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    87
  general population.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    88
* The apparent effect might be due to a biased sampling process, so it would not
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    89
  appear in the general population.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    90
* The apparent effect might be due to measurement errors.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    91
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    92
Asymptotic approximation
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    93
========================
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    94
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    95
CLT say that sample mean distribution is approximated by normal distribution.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    96
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    97
With fair enough number of samples approximation is quite good.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    98
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
    99
So during hypothesis testing usually researcher makes assumption that is is safe
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   100
to replace unknown distribution of means for independent and identicaly
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   101
distributed individual samples with approximation.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   102
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   103
For really small number of samples Student distribution is used instead of
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   104
normal distribution. But again it means that researcher made assumption and you
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   105
may not agree with it, so it is your right to reject any subsequent decision
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   106
based on "wrong" assumption.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   107
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   108
Type I error
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   109
============
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   110
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   111
:def:`Type I error` is the incorrect rejection of a true null hypothesis (a
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   112
*false positive*).
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   113
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   114
Type I error rate is at most :math:`α` (significant level).
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   115
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   116
The p-value of a test is the maximum false positive risk you would take by
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   117
rejecting the null hypothesis.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   118
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   119
Type II error
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   120
=============
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   121
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   122
:def:`Type II error` is failing to reject a false null hypothesis (a *false
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   123
negative*).
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   124
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   125
Probability of type II error usually called :math:`β`.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   126
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   127
Power
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   128
=====
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   129
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   130
:def:`Power` is a probability to reject null hypothesis when it's false. So
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   131
power probability is :math:`1-β`.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   132
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   133
Confidence interval
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   134
===================
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   135
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   136
:def:`Confidence interval` 
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   137
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   138
  https://en.wikipedia.org/wiki/Confidence_interval
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   139
    Wikipedia page
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   140
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   141
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   142
Question
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   143
========
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   144
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   145
What to do with null hypothesis in classical inference?
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   146
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   147
I successfully shirked stat classes 10 years ago (last night reading help me
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   148
actually to pass exam) and now when I take several Coursera stat classes I have
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   149
difficulties with understanding **null hypothesis**. Somehow with unclear
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   150
intuition I passed quizzes but want to understand subject.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   151
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   152
Suppose we have population and sample some data from population. Reasonable
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   153
question: is some property of sample make evidence to be true on population?
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   154
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   155
Statistic is a real number that can be derived from population or sample.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   156
Classical example is a mean value.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   157
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   158
We ask is it statistically significant that statistic of population is near to
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   159
statistic of sample.
328995b5b8fd init stat project
Oleksandr Gavenko <gavenkoa@gmail.com>
parents:
diff changeset
   160