www/HACKING.rst
changeset 636 bc521aba85bc
parent 635 445ee650a9ba
child 642 c1032aea6265
--- a/www/HACKING.rst	Tue Nov 08 16:11:20 2016 +0200
+++ b/www/HACKING.rst	Tue Nov 08 16:38:28 2016 +0200
@@ -212,8 +212,35 @@
 Word lists
 ==========
 
+Frequency wordlists use several statistics:
+
+* number of word occurrences in corpus, usually marked by ``F``
+* adjusted number of occurrences per 1.000.000 in corpus, usually marked by
+  ``U``
+* Standard Frequency Index (SFI) is a:
+
+  .. math:: SFI = 40 + 10 * log_10(U)
+
+  ===  ================
+  SFI       Freq
+  ===  ================
+  90   1 per 10
+  80   1 per 100
+  70   1 per 1000
+  60   1 per 10.000
+  50   1 per 100.000
+  40   1 per 1.000.000
+  30   1 per 10.000.000
+  ===  ================
+* deviation of word frequency across documents in corpus, usually marked by
+  ``D``
+
+Sorting numerically on first= column::
+
+  $ sort -k 1nr,2 <$IN >$OUT
+
 OANC frequency wordlist
-=======================
+-----------------------
 
 The Open American National Corpus (OANC) is a roughly 15 million word subset of
 the ANC Second Release that is unrestricted in terms of usage and