# HG changeset patch # User Oleksandr Gavenko # Date 1478619562 -7200 # Node ID c1032aea6265c44463b8fe1ef842d8cd15e0b17f # Parent a49a091d8231ccad7a1149f5c19355535d46be4c Describe word list sources. diff -r a49a091d8231 -r c1032aea6265 www/HACKING.rst --- a/www/HACKING.rst Tue Nov 08 17:31:11 2016 +0200 +++ b/www/HACKING.rst Tue Nov 08 17:39:22 2016 +0200 @@ -269,6 +269,114 @@ http://www.anc.org/data/oanc/ OANC home page. +https://en.wikipedia.org/wiki/Word_lists_by_frequency + +Useful word lists: + + +https://en.wikipedia.org/wiki/Academic_Word_List + Academic Word List at Wikipedia. +https://web.archive.org/web/20080212073904/http://language.massey.ac.nz/staff/awl/headwords.shtml + Academic Word List by Averil Coxhead created in 2000 as addition to GSL and + has 570 headwords. + +Obsolete or proprietary word list: + +https://en.wikipedia.org/wiki/Basic_English + 850 headword list created in 1930. + +General Service List +-------------------- + +Updated GSL (General Service List) was obtained from: + +http://jbauman.com/gsl.html + A 1995 revised version of the GSL with minor changes by John Bauman. He added + 284 new headwords to original 2000 word list created by Michael West in 1953. + +First column represents the number of occurrences per 1,000,000 words of the +Brown corpus based on counting word families. + +https://en.wikipedia.org/wiki/General_Service_List + General Service List at Wikipedia. +http://jbauman.com/aboutgsl.html + About the General Service List by John Bauman. + +New General Service List +------------------------ + +NGSL was obtained from: + +http://www.newgeneralservicelist.org/s/NGSL-101-by-band-qq9o.xlsx + Microsoft XLS file with headword, frequency and SFI. + +First column represents the adjusted frequency per 1,000,000 words and counting +base word families. + +Academic Word List +------------------ + +The Academic Word List (AWL) was published in the Summer, 2000 issue of the +TESOL Quarterly (v. 34, no. 2). It was devloped by Averil Coxhead, of Victoria +University of Wellington, in New Zealand. The AWL is a replacement for the +University Word List (published by Paul Nation in 1984). + +AWL (Academic Word List) is obtained from: + +https://web.archive.org/web/20081014065815/http://language.massey.ac.nz/staff/awl/download/awlheadwords.rtf + Original Academic Word List in RTF format. + +Its structure is headword following by frequency level (from 1 as most frequent +to 10 as least frequent). + +New Academic Word List +---------------------- + +Frequency word list was obtained from: + +http://www.newacademicwordlist.org/s/NAWL_SFI.csv + CSV with colums ``Word,SFI,U,D``. + +``SFI`` and ``D`` columns was deleted and ``U`` and ``Word`` column was swapped. +Data was sorted by ``U`` column (adjusted frequency per 1,000,000 words). + +NSWL headword list with word variations was obtained from: + +http://www.laurenceanthony.net/software/antwordprofiler/ + Laurence Anthony's AntWordProfiler home page. + +It is encoded in ``latin-1`` and recoded into ``utf-8`` (because of ``É`` +symbol). + +See also: + +http://www.newacademicwordlist.org/ + Home page. + +Special English word list +------------------------- + +https://en.wikipedia.org/wiki/Special_English + Special English is a controlled version of the English languageused by the + United States broadcasting service Voice of America (VOA). 1557 headwords. + +BNC+COCA wordlist +----------------- + +Paul Nation prepare frequency wordlist from combined BNC and COCA corpus: + +http://www.victoria.ac.nz/lals/about/staff/paul-nation + Paul Nation's home page and list download page. +https://simple.wiktionary.org/wiki/Wiktionary:BNC_spoken_freq + About list on Wikimedia. + +It has 25000 basewords (and each baseword comes with variations) splited into +chunks by 1000 words. + +I get list from: + +http://www.laurenceanthony.net/software/antwordprofiler/ + Laurence Anthony's AntWordProfiler home page. Register gadict dictionaries for dictd under Debian ===================================================