equal
deleted
inserted
replaced
210 DEBII — Dictionary Editor and Browser |
210 DEBII — Dictionary Editor and Browser |
211 |
211 |
212 Word lists |
212 Word lists |
213 ========== |
213 ========== |
214 |
214 |
|
215 Frequency wordlists use several statistics: |
|
216 |
|
217 * number of word occurrences in corpus, usually marked by ``F`` |
|
218 * adjusted number of occurrences per 1.000.000 in corpus, usually marked by |
|
219 ``U`` |
|
220 * Standard Frequency Index (SFI) is a: |
|
221 |
|
222 .. math:: SFI = 40 + 10 * log_10(U) |
|
223 |
|
224 === ================ |
|
225 SFI Freq |
|
226 === ================ |
|
227 90 1 per 10 |
|
228 80 1 per 100 |
|
229 70 1 per 1000 |
|
230 60 1 per 10.000 |
|
231 50 1 per 100.000 |
|
232 40 1 per 1.000.000 |
|
233 30 1 per 10.000.000 |
|
234 === ================ |
|
235 * deviation of word frequency across documents in corpus, usually marked by |
|
236 ``D`` |
|
237 |
|
238 Sorting numerically on first= column:: |
|
239 |
|
240 $ sort -k 1nr,2 <$IN >$OUT |
|
241 |
215 OANC frequency wordlist |
242 OANC frequency wordlist |
216 ======================= |
243 ----------------------- |
217 |
244 |
218 The Open American National Corpus (OANC) is a roughly 15 million word subset of |
245 The Open American National Corpus (OANC) is a roughly 15 million word subset of |
219 the ANC Second Release that is unrestricted in terms of usage and |
246 the ANC Second Release that is unrestricted in terms of usage and |
220 redistribution. |
247 redistribution. |
221 |
248 |