www/HACKING.rst
author Oleksandr Gavenko <gavenkoa@gmail.com>
Wed, 21 Sep 2016 22:09:37 +0300
changeset 565 ac68f2680ea0
parent 542 b5197c70972c
child 566 0bba61492c37
permissions -rw-r--r--
Add syntax to add related words. Add separators between ant/syn/rel in generated output.

.. -*- coding: utf-8 -*-

======================
 gadict HACKING guide
======================
.. contents::
   :local:

Versioning rules
================

We use **major.minor** schema.

Until we reach 5000 words **major** is 0. **minor** updated from time to time.

Getting sources
===============

Cloning repository::

  $ hg clone http://hg.defun.work/gadict gadict
  $ hg clone http://hg.code.sf.net/p/gadict/code gadict-hg

Pushing changes::

  $ hg push ssh://$USER@hg.defun.work/gadict
  $ hg push ssh://$USER@hg.code.sf.net/p/gadict/code
  $ hg push https://$USER:$PASS@hg.code.sf.net/p/gadict/code

Browsing sources online
=======================

  http://hg.defun.work/gadict
    hgweb at home page.
  http://hg.code.sf.net/p/gadict/code
    hgweb at old home page (but supported as mirror).
  https://sourceforge.net/p/gadict/code/
    Sourceforge Allure interface (not primary, a mirror).

Dictionary file name convention
===============================

BNF form::

  FILE ::= "gadict_" NAME ".gadict"

``NAME`` may have form ``ISOCODE "-" ISOCODE``, like ``en-ru``, where
``ISOCODE`` is ISO 639-1 language (2 letter) code

``NAME`` may be a dictionary abbreviation name.

During dictionaries compilation and releases ``".gadict"`` suffix changed to
appropriated but base name should be preserved as ``"gadict_" NAME``.

Dictionary source file format
=============================

gadict project uses dictd C5 source file format in the past. C5 format have
several issues:

 * C5 is not structural format. So producing another forms and conversion to
   other formats is not possible.
 * C5 have no markup for links neither for any other markups.

Before that project used dictd TAB file format which require placing article in
a single long line. That format is not for human editing at all.

Other dictionary source file formats are considered as choice, like TEI, ISO,
xdxf, MDF. XML like formats also are not for human editing. Also XML lack of
syntax locality and full file should be scanned to validate local changes...

Note that StarDict, AbbyLinguo, Babylon, dictd formats are not considered
because they all about a presentation but not a structure. They are target
formats for compilation.

Fancy looking analog to MDF + C5 was developed.

Beginning of file describe dictionary information.

Each article separated by ``\n__\n\n`` and consists of two parts:

 * word variations with pronunciation
 * word translations, with supplementary information like part of speach,
   synonyms, antonyms, example of usage

*Word variation* are:

* *singularity* or *number*: ``s`` - single, ``pl`` - plural.
* *verb voice* or *verb tense*: ``v1`` - infinitive, ``v2`` - past tense,
  ``v3`` past participle tense.
* *gender*: ``male`` or ``female``.
* *comparison*: ``comp`` - comparative or ``super`` - superlative.

*Parts of speech* are:

* ``n`` - noun
* ``pron`` - pronoun
* ``adj`` - adjective
* ``v`` - verb
* ``adv`` - adverb
* ``prep`` - preposition
* ``conj`` - conjunction
* ``num`` - numeral
* ``int`` - interjection
* ``abbr`` - abbreviation
* ``phr`` - phrase
* ``phr.v`` - phrasal verb
* ``contr`` - contraction
* ``prefix`` - word prefix

Each meaning may refer to topics, like:

* ``sci`` - about science
* ``body`` - part of body
* ``math`` - mathematics
* ``chem`` - chemicals
* ``bio`` - biology
* ``music``
* ``meal``, ``office``, etc
* ``size``, ``shape``, ``age``, ``color``

Synonyms marked by ``syn:``, antonyms marked by ``ant:``, related (see also)
terms marked by ``rel:``, topics/tags marked by ``topic:``.

Translation marked by lowercase ISO 639-1 code, like:

* ``en:`` - English
* ``ru:`` - Russian
* ``uk:`` - Ukrainian
* ``la:`` - Latin

Pronunciation variants marked by:

* ``Am`` - American
* ``Br`` - Great Britain
* ``Au`` - Australian

C5 dictionary source file format
================================

For source file format used dictd C5 file format. See::

  $ man 1 dictfmt

Shortly:

 * Headwords was preceded by 5 or more underscore characters ``_`` and a blank
   line.
 * Article may have several headwords, in that case they are placed in one line
   and separated by ``;<SPACE>``.
 * All text until the next headword is considered as the definition.
 * Any leading ``@`` characters are stripped out, but the file is otherwise
   unchanged.
 * UTF-8 encoding is supported at least by Goldendict.

gadict project used C5 format in the past but switched to own format.

TODO convention
===============

Entries or parts of text that was not completed marked by keywords:

  TODO
    incomplete
  XXX
    urgent incomplete

Makefile rules ``todo`` find this occurrence in sources::

  $ make todo

World wide dictionary formats and standards
===========================================

  http://en.wikipedia.org/wiki/Dictionary_writing_system
                Dictionary writing system
  http://www.sil.org/computing/shoebox/mdf.html
                Multi-Dictionary Formatter (MDF). It defines about 100 data
                field markers.
  http://fieldworks.sil.org/flex/
                FieldWorks Language Explorer (or FLEx, for short) is designed
                to help field linguists perform many common language
                documentation and analysis tasks.
  http://code.google.com/p/lift-standard/
                LIFT (Lexicon Interchange FormaT) is an XML format for storing
                lexical information, as used in the creation of dictionaries.
                It's not necessarily the format for your lexicon.
  http://www.lexiquepro.com/
                Lexique Pro is an interactive lexicon viewer and editor, with
                hyperlinks between entries, category views, dictionary
                reversal, search, and export tools. It's designed to display
                your data in a user-friendly format so you can distribute it
                to others.
  http://deb.fi.muni.cz/index.php
                DEBII — Dictionary Editor and Browser

Register gadict dictionaries for dictd under Debian
===================================================
::

  $ su
  $ cat >>etc/dictd/dictd.order <<EOF
  gadict-dictabbr
  /home/user/usr/share/dictd/
  $ dictdconfig --write
  $ /etc/init.d/dictd restart
  $ ^D
  $ dictdconfig --list
  $ dict -d gadict-dictabbr v

Typing IPA chars in Emacs
=========================

For entering IPA chars use IPA input method. To enable it type::

  C-u C-\ ipa <enter>

All chars from alphabet typed as usual. To type special IPA chars use next key
bindings (or read help in Emacs by ``M-x describe-input-method`` or ``C-h I``).

For vowel::

  æ  ae
  ɑ  o| or A
  ɒ  |o  or /A
  ʊ  U
  ɛ  /3 or E
  ɔ  /c
  ə  /e
  ʌ  /v
  ɪ  I

For consonant::

  θ  th
  ð  dh
  ʃ  sh
  ʧ  tsh
  ʒ  zh or 3
  ŋ  ng
  ɡ  g
  ɹ  /r

Special chars::

  ː  : (semicolon)
  ˈ  ' (quote)
  ˌ  ` (back quote)

Alternatively use ``ipa-x-sampa`` or ``ipa-kirshenbaum`` input method (for help
type: ``C-h I ipa-x-sampa RET`` or ``C-h I ipa-kirshenbaum RET``).