www/HACKING.rst
changeset 636 bc521aba85bc
parent 635 445ee650a9ba
child 642 c1032aea6265
equal deleted inserted replaced
635:445ee650a9ba 636:bc521aba85bc
   210                 DEBII — Dictionary Editor and Browser
   210                 DEBII — Dictionary Editor and Browser
   211 
   211 
   212 Word lists
   212 Word lists
   213 ==========
   213 ==========
   214 
   214 
       
   215 Frequency wordlists use several statistics:
       
   216 
       
   217 * number of word occurrences in corpus, usually marked by ``F``
       
   218 * adjusted number of occurrences per 1.000.000 in corpus, usually marked by
       
   219   ``U``
       
   220 * Standard Frequency Index (SFI) is a:
       
   221 
       
   222   .. math:: SFI = 40 + 10 * log_10(U)
       
   223 
       
   224   ===  ================
       
   225   SFI       Freq
       
   226   ===  ================
       
   227   90   1 per 10
       
   228   80   1 per 100
       
   229   70   1 per 1000
       
   230   60   1 per 10.000
       
   231   50   1 per 100.000
       
   232   40   1 per 1.000.000
       
   233   30   1 per 10.000.000
       
   234   ===  ================
       
   235 * deviation of word frequency across documents in corpus, usually marked by
       
   236   ``D``
       
   237 
       
   238 Sorting numerically on first= column::
       
   239 
       
   240   $ sort -k 1nr,2 <$IN >$OUT
       
   241 
   215 OANC frequency wordlist
   242 OANC frequency wordlist
   216 =======================
   243 -----------------------
   217 
   244 
   218 The Open American National Corpus (OANC) is a roughly 15 million word subset of
   245 The Open American National Corpus (OANC) is a roughly 15 million word subset of
   219 the ANC Second Release that is unrestricted in terms of usage and
   246 the ANC Second Release that is unrestricted in terms of usage and
   220 redistribution.
   247 redistribution.
   221 
   248