www/HACKING.rst
author Oleksandr Gavenko <gavenkoa@gmail.com>
Wed, 09 Aug 2017 23:27:11 +0300
changeset 903 3bbe249dae47
parent 878 691dafb44619
child 905 bc1807ccf58e
permissions -rw-r--r--
Explain order of work relations.

.. -*- coding: utf-8 -*-

======================
 gadict HACKING guide
======================
.. contents::
   :local:

Versioning rules
================

We use **major.minor** schema.

Until we reach 5000 words **major** is 0. **minor** updated from time to time.

Getting sources
===============

Cloning repository::

  $ hg clone http://hg.defun.work/gadict gadict
  $ hg clone http://hg.code.sf.net/p/gadict/code gadict-hg

Pushing changes::

  $ hg push ssh://$USER@hg.defun.work/gadict
  $ hg push ssh://$USER@hg.code.sf.net/p/gadict/code
  $ hg push https://$USER:$PASS@hg.code.sf.net/p/gadict/code

Browsing sources online
=======================

  http://hg.defun.work/gadict
    hgweb at home page.
  http://hg.code.sf.net/p/gadict/code
    hgweb at old home page (but supported as mirror).
  https://sourceforge.net/p/gadict/code/
    Sourceforge Allure interface (not primary, a mirror).

Building project
================

``gadict`` project provides dictionaries encoded in custom format. In order to
precess them you need GNU Make and Python 2.7 and possibly other tools.

To produce dictionaries in ``dictd`` format you need to install ``dictd``
dictribution with ``dictfmt`` and ``dictzip`` utilities and run::

  $ make dict

To produce Anki decks install Anki v... on Linux or get Anki sources at specific
version (before port to Python 3)::

  $ git clone https://github.com/dae/anki.git
  $ cd anki
  $ git co  15b349e3^

and put path to Anki project source dir inside ``Makefile.config``::

  ANKI_PY_DIR := $(HOME)/devel/anki

Build command to make Anki deks is::

  $ make anki

Dictionary source file format
=============================

gadict project uses dictd C5 source file format in the past. C5 format have
several issues:

 * C5 is not structural format. So producing another forms and conversion to
   other formats is not possible.
 * C5 have no markup for links neither for any other markups.

Before that project used dictd TAB file format which require placing article in
a single long line. That format is not for human editing at all.

Other dictionary source file formats are considered as choice, like TEI, ISO,
xdxf, MDF. XML like formats also are not for human editing. Also XML lack of
syntax locality and full file should be scanned to validate local changes...

Note that StarDict, AbbyLinguo, Babylon, dictd formats are not considered
because they all about a presentation but not a structure. They are target
formats for compilation.

Fancy looking analog to MDF + C5 was developed.

Beginning of file describe dictionary information.

Each article separated by ``\n__\n\n`` and consists of two parts:

 * word variations with pronunciation
 * word translations, with supplementary information like part of speach,
   synonyms, antonyms, example of usage

*Word variation* are:

* *singularity* or *number*: ``s`` - single, ``pl`` - plural.
* *verb voice* or *verb tense*: ``v1`` - infinitive, ``v2`` - past tense,
  ``v3`` past participle tense.
* *gender*: ``male`` or ``female``.
* *comparison*: ``comp`` - comparative or ``super`` - superlative.

*Parts of speech* (ordered by preference):

* ``v`` - verb
* ``n`` - noun
* ``pron`` - pronoun
* ``adv`` - adverb
* ``adj`` - adjective
* ``prep`` - preposition
* ``conj`` - conjunction
* ``num`` - numeral
* ``int`` - interjection
* ``abbr`` - abbreviation
* ``phr`` - phrase
* ``phr.v`` - phrasal verb
* ``contr`` - contraction
* ``prefix`` - word prefix

.. note:: I try to keep word meanings in article in above POS order.

Each meaning may refer to topics, like:

* ``sci`` - about science
* ``body`` - part of body
* ``math`` - mathematics
* ``chem`` - chemicals
* ``bio`` - biology
* ``music``
* ``meal``, ``office``, etc
* ``size``, ``shape``, ``age``, ``color``
* ``archaic`` - old fashioned, no longer used

*Word relation* (ordered by preference):

* ``topic:`` - topics/tags
* ``ant:`` - antonyms
* ``syn:`` - synonyms
* ``hyper:`` - hypernyms
* ``hypo:`` - hyponyms
* ``rel:`` - related (see also) terms

Translation marked by lowercase ISO 639-1 code with ``:`` (colon) character,
like:

* ``en:`` - English
* ``ru:`` - Russian
* ``uk:`` - Ukrainian
* ``la:`` - Latin

Example marked by lowercase ISO 639-1 code with ``>`` (greater) character.

Explanation or glossary marked by lowercase ISO 639-1 code with ``=`` (equal)
character.

Pronunciation variants marked by:

* ``Am`` - American
* ``Br`` - Great Britain
* ``Au`` - Australian

``rare`` attribute to first headword used as marker that word has low frequency.
SRS file writers skip entries marked as ``rare``. I found it convenient to check
frequency with:

https://books.google.com/ngrams/
  Google N-grams from books 1800-2010.

For cut-off point I chose ``beseech`` word. All less frequent words receive
``rare`` marker.

C5 dictionary source file format
================================

For source file format used dictd C5 file format. See::

  $ man 1 dictfmt

Shortly:

 * Headwords was preceded by 5 or more underscore characters ``_`` and a blank
   line.
 * Article may have several headwords, in that case they are placed in one line
   and separated by ``;<SPACE>``.
 * All text until the next headword is considered as the definition.
 * Any leading ``@`` characters are stripped out, but the file is otherwise
   unchanged.
 * UTF-8 encoding is supported at least by Goldendict.

gadict project used C5 format in the past but switched to own format.

TODO convention
===============

Entries or parts of text that was not completed marked by keywords:

  TODO
    incomplete
  XXX
    urgent incomplete

Makefile rules ``todo`` find this occurrence in sources::

  $ make todo

World wide dictionary formats and standards
===========================================

  http://en.wikipedia.org/wiki/Dictionary_writing_system
                Dictionary writing system
  http://www.sil.org/computing/shoebox/mdf.html
                Multi-Dictionary Formatter (MDF). It defines about 100 data
                field markers.
  http://fieldworks.sil.org/flex/
                FieldWorks Language Explorer (or FLEx, for short) is designed
                to help field linguists perform many common language
                documentation and analysis tasks.
  http://code.google.com/p/lift-standard/
                LIFT (Lexicon Interchange FormaT) is an XML format for storing
                lexical information, as used in the creation of dictionaries.
                It's not necessarily the format for your lexicon.
  http://www.lexiquepro.com/
                Lexique Pro is an interactive lexicon viewer and editor, with
                hyperlinks between entries, category views, dictionary
                reversal, search, and export tools. It's designed to display
                your data in a user-friendly format so you can distribute it
                to others.
  http://deb.fi.muni.cz/index.php
                DEBII — Dictionary Editor and Browser

Linguistic sources
==================

Ukrainian linguistics corpora
-----------------------------

**National corpus of Russian language**. There is parallel Russian-Ukrainian
texts. Search by keywords, grammatical function, thesaurus properties and other
properties.

http://www.ruscorpora.ru/search-para-uk.html
  Page for querying online.

**Corpus of mova.info project**. Thtere are literal search and search by word
family.

http://www.mova.info/corpus.aspx
  Page for querying online.

Word lists
==========

Frequency wordlists use several statistics:

* number of word occurrences in corpus, usually marked by ``F``
* adjusted number of occurrences per 1.000.000 in corpus, usually marked by
  ``U``
* Standard Frequency Index (SFI) is a:

  .. math:: SFI = 40 + 10 * log_10(U)

  ===  ================
  SFI       Freq
  ===  ================
  90   1 per 10
  80   1 per 100
  70   1 per 1000
  60   1 per 10.000
  50   1 per 100.000
  40   1 per 1.000.000
  30   1 per 10.000.000
  ===  ================
* deviation of word frequency across documents in corpus, usually marked by
  ``D``

Sorting numerically on first= column::

  $ sort -k 1nr,2 <$IN >$OUT

OANC frequency wordlist
-----------------------

The Open American National Corpus (OANC) is a roughly 15 million word subset of
the ANC Second Release that is unrestricted in terms of usage and
redistribution.

I've got OANC from link: http://www.anc.org/OANC/OANC-1.0.1-UTF8.zip

After unpacking only ``.txt`` files::

  $ unzip OANC-1.0.1-UTF8.zip '*.txt'
  $ cd OANC; find . -type f | xargs cat | wc
  2090929 14586935 96737202

I built frequency list with:

http://www.laurenceanthony.net/software/antconc/
  A freeware corpus analysis toolkit for concordancing and text analysis.

manually removed single and double letter words, filter out misspelled words
with ``en_US`` ``hunspell`` spell-checker and merged word variations to baseform
with using WordNet. See details in ``obsolete/oanc.py``.

http://www.anc.org/data/oanc/download/
  OANC download page.

http://www.anc.org/data/oanc/
  OANC home page.

https://en.wikipedia.org/wiki/Word_lists_by_frequency

Useful word lists:


https://en.wikipedia.org/wiki/Academic_Word_List
  Academic Word List at Wikipedia.
https://web.archive.org/web/20080212073904/http://language.massey.ac.nz/staff/awl/headwords.shtml
  Academic Word List by Averil Coxhead created in 2000 as addition to GSL and
  has 570 headwords.

Obsolete or proprietary word list:

https://en.wikipedia.org/wiki/Basic_English
  850 headword list created in 1930.

General Service List
--------------------

Updated GSL (General Service List) was obtained from:

http://jbauman.com/gsl.html
  A 1995 revised version of the GSL with minor changes by John Bauman. He added
  284 new headwords to original 2000 word list created by Michael West in 1953.

First column represents the number of occurrences per 1,000,000 words of the
Brown corpus based on counting word families.

https://en.wikipedia.org/wiki/General_Service_List
  General Service List at Wikipedia.
http://jbauman.com/aboutgsl.html
  About the General Service List by John Bauman.

New General Service List
------------------------

NGSL was obtained from:

http://www.newgeneralservicelist.org/s/NGSL-101-by-band-qq9o.xlsx
  Microsoft XLS file with headword, frequency and SFI.

First column represents the adjusted frequency per 1,000,000 words and counting
base word families.

Academic Word List
------------------

The Academic Word List (AWL) was published in the Summer, 2000 issue of the
TESOL Quarterly (v. 34, no. 2). It was devloped by Averil Coxhead, of Victoria
University of Wellington, in New Zealand. The AWL is a replacement for the
University Word List (published by Paul Nation in 1984).

AWL (Academic Word List) is obtained from:

https://web.archive.org/web/20081014065815/http://language.massey.ac.nz/staff/awl/download/awlheadwords.rtf
  Original Academic Word List in RTF format.

Its structure is headword following by frequency level (from 1 as most frequent
to 10 as least frequent).

New Academic Word List
----------------------

Frequency word list was obtained from:

http://www.newacademicwordlist.org/s/NAWL_SFI.csv
  CSV with colums ``Word,SFI,U,D``.

``SFI`` and ``D`` columns was deleted and ``U`` and ``Word`` column was swapped.
Data was sorted by ``U`` column (adjusted frequency per 1,000,000 words).

NSWL headword list with word variations was obtained from:

http://www.laurenceanthony.net/software/antwordprofiler/
  Laurence Anthony's AntWordProfiler home page.

It is encoded in ``latin-1`` and recoded into ``utf-8`` (because of ``É``
symbol).

See also:

http://www.newacademicwordlist.org/
  Home page.

Special English word list
-------------------------

https://en.wikipedia.org/wiki/Special_English
  Special English is a controlled version of the English languageused by the
  United States broadcasting service Voice of America (VOA). 1557 headwords.

Business Service List
---------------------

The 1700 words of the BSL 1.01 version gives up to 97% coverage of general
business English materials when combined with the 2800 words of the NGSL.

Wordlist with variations was obtained from:

http://www.newgeneralservicelist.org/s/AWPngslbsl-twcg.zip
  In AntWordProfiler compatable format.

http://www.newgeneralservicelist.org/bsl-business-service-list/
  BSL home & download page.

TOEIC Service List
------------------

Based on a 1.5 million word corpus of various TOEIC preparation materials, the
1200 words of the TSL 1.1 version gives up to 99% coverage of TOEIC materials
and tests when combined with the 2800 words of the NGSL.

Wordlist with variations was obtained from:

http://www.newgeneralservicelist.org/s/AWPngsltsl.zip
  In AntWordProfiler compatable format.

http://www.newgeneralservicelist.org/toeic-list/
  The TOEIC Service List home page.

BNC+COCA wordlist
-----------------

Paul Nation prepare frequency wordlist from combined BNC and COCA corpus:

http://www.victoria.ac.nz/lals/about/staff/paul-nation
  Paul Nation's home page and list download page.
https://simple.wiktionary.org/wiki/Wiktionary:BNC_spoken_freq
  About list on Wikimedia.

It has 25000 basewords (and each baseword comes with variations) splited into
chunks by 1000 words.

I get list from:

http://www.laurenceanthony.net/software/antwordprofiler/
  Laurence Anthony's AntWordProfiler home page.

Miscellaneous wordlists
-----------------------

The Dolch word list is a list of frequently used English words compiled by
Edward William Dolch. The list was prepared in 1936 and was originally published
in his book Problems in Reading in 1948. Dolch compiled the list based on
children's books of his era. The list contains 220 "service words". The
compilation excludes nouns, which comprise a separate 95-word list.

Dolch wordlist already covered by ``gadict``.

https://en.wikipedia.org/wiki/Dolch_word_list
  Wikipedia article with list itself.

The Leipzig-Jakarta list is a 100-word word list used by linguists to test the
degree of chronological separation of languages by comparing words that are
resistant to borrowing. The Leipzig-Jakarta list became available in 2009.

Leipzig-Jakarta wordlist already covered by ``gadict``.

https://en.wikipedia.org/wiki/Leipzig%E2%80%93Jakarta_list
  Wikipedia article with list itself.

The words in the Swadesh lists were chosen for their universal, culturally
independent availability in as many languages as possible. Swadesh's final list,
published in 1971, contains 100 terms.

Swadesh wordlist already covered by ``gadict`` except some rare words.

https://en.wikipedia.org/wiki/Swadesh_list

Typing IPA chars in Emacs
=========================

For entering IPA chars use IPA input method. To enable it type::

  C-u C-\ ipa <enter>

All chars from alphabet typed as usual. To type special IPA chars use next key
bindings (or read help in Emacs by ``M-x describe-input-method`` or ``C-h I``).

For vowel::

  æ  ae
  ɑ  o| or A
  ɒ  |o  or /A
  ʊ  U
  ɛ  /3 or E
  ɔ  /c
  ə  /e
  ʌ  /v
  ɪ  I

For consonant::

  θ  th
  ð  dh
  ʃ  sh
  ʧ  tsh
  ʒ  zh or 3
  ŋ  ng
  ɡ  g
  ɹ  /r

Special chars::

  ː  : (semicolon)
  ˈ  ' (quote)
  ˌ  ` (back quote)

Alternatively use ``ipa-x-sampa`` or ``ipa-kirshenbaum`` input method (for help
type: ``C-h I ipa-x-sampa RET`` or ``C-h I ipa-kirshenbaum RET``).