Describe word list sources.
authorOleksandr Gavenko <gavenkoa@gmail.com>
Tue, 08 Nov 2016 17:39:22 +0200
changeset 642 c1032aea6265
parent 641 a49a091d8231
child 643 c2c32f45dde6
Describe word list sources.
www/HACKING.rst
--- a/www/HACKING.rst	Tue Nov 08 17:31:11 2016 +0200
+++ b/www/HACKING.rst	Tue Nov 08 17:39:22 2016 +0200
@@ -269,6 +269,114 @@
 http://www.anc.org/data/oanc/
   OANC home page.
 
+https://en.wikipedia.org/wiki/Word_lists_by_frequency
+
+Useful word lists:
+
+
+https://en.wikipedia.org/wiki/Academic_Word_List
+  Academic Word List at Wikipedia.
+https://web.archive.org/web/20080212073904/http://language.massey.ac.nz/staff/awl/headwords.shtml
+  Academic Word List by Averil Coxhead created in 2000 as addition to GSL and
+  has 570 headwords.
+
+Obsolete or proprietary word list:
+
+https://en.wikipedia.org/wiki/Basic_English
+  850 headword list created in 1930.
+
+General Service List
+--------------------
+
+Updated GSL (General Service List) was obtained from:
+
+http://jbauman.com/gsl.html
+  A 1995 revised version of the GSL with minor changes by John Bauman. He added
+  284 new headwords to original 2000 word list created by Michael West in 1953.
+
+First column represents the number of occurrences per 1,000,000 words of the
+Brown corpus based on counting word families.
+
+https://en.wikipedia.org/wiki/General_Service_List
+  General Service List at Wikipedia.
+http://jbauman.com/aboutgsl.html
+  About the General Service List by John Bauman.
+
+New General Service List
+------------------------
+
+NGSL was obtained from:
+
+http://www.newgeneralservicelist.org/s/NGSL-101-by-band-qq9o.xlsx
+  Microsoft XLS file with headword, frequency and SFI.
+
+First column represents the adjusted frequency per 1,000,000 words and counting
+base word families.
+
+Academic Word List
+------------------
+
+The Academic Word List (AWL) was published in the Summer, 2000 issue of the
+TESOL Quarterly (v. 34, no. 2). It was devloped by Averil Coxhead, of Victoria
+University of Wellington, in New Zealand. The AWL is a replacement for the
+University Word List (published by Paul Nation in 1984).
+
+AWL (Academic Word List) is obtained from:
+
+https://web.archive.org/web/20081014065815/http://language.massey.ac.nz/staff/awl/download/awlheadwords.rtf
+  Original Academic Word List in RTF format.
+
+Its structure is headword following by frequency level (from 1 as most frequent
+to 10 as least frequent).
+
+New Academic Word List
+----------------------
+
+Frequency word list was obtained from:
+
+http://www.newacademicwordlist.org/s/NAWL_SFI.csv
+  CSV with colums ``Word,SFI,U,D``.
+
+``SFI`` and ``D`` columns was deleted and ``U`` and ``Word`` column was swapped.
+Data was sorted by ``U`` column (adjusted frequency per 1,000,000 words).
+
+NSWL headword list with word variations was obtained from:
+
+http://www.laurenceanthony.net/software/antwordprofiler/
+  Laurence Anthony's AntWordProfiler home page.
+
+It is encoded in ``latin-1`` and recoded into ``utf-8`` (because of ``É``
+symbol).
+
+See also:
+
+http://www.newacademicwordlist.org/
+  Home page.
+
+Special English word list
+-------------------------
+
+https://en.wikipedia.org/wiki/Special_English
+  Special English is a controlled version of the English languageused by the
+  United States broadcasting service Voice of America (VOA). 1557 headwords.
+
+BNC+COCA wordlist
+-----------------
+
+Paul Nation prepare frequency wordlist from combined BNC and COCA corpus:
+
+http://www.victoria.ac.nz/lals/about/staff/paul-nation
+  Paul Nation's home page and list download page.
+https://simple.wiktionary.org/wiki/Wiktionary:BNC_spoken_freq
+  About list on Wikimedia.
+
+It has 25000 basewords (and each baseword comes with variations) splited into
+chunks by 1000 words.
+
+I get list from:
+
+http://www.laurenceanthony.net/software/antwordprofiler/
+  Laurence Anthony's AntWordProfiler home page.
 
 Register gadict dictionaries for dictd under Debian
 ===================================================