# HG changeset patch # User Oleksandr Gavenko # Date 1478615908 -7200 # Node ID bc521aba85bc254c035ef2b4010d39797430a9bc # Parent 445ee650a9ba72e5c7e81b6116b6d605b797edef Frequency wordlists use several statistics. diff -r 445ee650a9ba -r bc521aba85bc www/HACKING.rst --- a/www/HACKING.rst Tue Nov 08 16:11:20 2016 +0200 +++ b/www/HACKING.rst Tue Nov 08 16:38:28 2016 +0200 @@ -212,8 +212,35 @@ Word lists ========== +Frequency wordlists use several statistics: + +* number of word occurrences in corpus, usually marked by ``F`` +* adjusted number of occurrences per 1.000.000 in corpus, usually marked by + ``U`` +* Standard Frequency Index (SFI) is a: + + .. math:: SFI = 40 + 10 * log_10(U) + + === ================ + SFI Freq + === ================ + 90 1 per 10 + 80 1 per 100 + 70 1 per 1000 + 60 1 per 10.000 + 50 1 per 100.000 + 40 1 per 1.000.000 + 30 1 per 10.000.000 + === ================ +* deviation of word frequency across documents in corpus, usually marked by + ``D`` + +Sorting numerically on first= column:: + + $ sort -k 1nr,2 <$IN >$OUT + OANC frequency wordlist -======================= +----------------------- The Open American National Corpus (OANC) is a roughly 15 million word subset of the ANC Second Release that is unrestricted in terms of usage and