Frequency wordlists use several statistics.
--- a/www/HACKING.rst Tue Nov 08 16:11:20 2016 +0200
+++ b/www/HACKING.rst Tue Nov 08 16:38:28 2016 +0200
@@ -212,8 +212,35 @@
Word lists
==========
+Frequency wordlists use several statistics:
+
+* number of word occurrences in corpus, usually marked by ``F``
+* adjusted number of occurrences per 1.000.000 in corpus, usually marked by
+ ``U``
+* Standard Frequency Index (SFI) is a:
+
+ .. math:: SFI = 40 + 10 * log_10(U)
+
+ === ================
+ SFI Freq
+ === ================
+ 90 1 per 10
+ 80 1 per 100
+ 70 1 per 1000
+ 60 1 per 10.000
+ 50 1 per 100.000
+ 40 1 per 1.000.000
+ 30 1 per 10.000.000
+ === ================
+* deviation of word frequency across documents in corpus, usually marked by
+ ``D``
+
+Sorting numerically on first= column::
+
+ $ sort -k 1nr,2 <$IN >$OUT
+
OANC frequency wordlist
-=======================
+-----------------------
The Open American National Corpus (OANC) is a roughly 15 million word subset of
the ANC Second Release that is unrestricted in terms of usage and