www/HACKING.rst
changeset 635 445ee650a9ba
parent 634 4f97d314c5e5
child 636 bc521aba85bc
--- a/www/HACKING.rst	Tue Nov 08 15:57:20 2016 +0200
+++ b/www/HACKING.rst	Tue Nov 08 16:11:20 2016 +0200
@@ -209,6 +209,40 @@
   http://deb.fi.muni.cz/index.php
                 DEBII — Dictionary Editor and Browser
 
+Word lists
+==========
+
+OANC frequency wordlist
+=======================
+
+The Open American National Corpus (OANC) is a roughly 15 million word subset of
+the ANC Second Release that is unrestricted in terms of usage and
+redistribution.
+
+I've got OANC from link: http://www.anc.org/OANC/OANC-1.0.1-UTF8.zip
+
+After unpacking only ``.txt`` files::
+
+  $ unzip OANC-1.0.1-UTF8.zip '*.txt'
+  $ cd OANC; find . -type f | xargs cat | wc
+  2090929 14586935 96737202
+
+I built frequency list with:
+
+http://www.laurenceanthony.net/software/antconc/
+  A freeware corpus analysis toolkit for concordancing and text analysis.
+
+manually removed single and double letter words, filter out misspelled words
+with ``en_US`` ``hunspell`` spell-checker and merged word variations to baseform
+with using WordNet. See details in ``obsolete/oanc.py``.
+
+http://www.anc.org/data/oanc/download/
+  OANC download page.
+
+http://www.anc.org/data/oanc/
+  OANC home page.
+
+
 Register gadict dictionaries for dictd under Debian
 ===================================================
 ::