# HG changeset patch # User Oleksandr Gavenko # Date 1457891198 -7200 # Node ID f089cd68ea7bdfe074c0c7df1b041cb8a0713da0 # Parent 05df277da4046bc832b68b11baf2e3a7d09ce4d5 Dictionary source file format. diff -r 05df277da404 -r f089cd68ea7b www/HACKING.rst --- a/www/HACKING.rst Sun Mar 13 18:43:54 2016 +0200 +++ b/www/HACKING.rst Sun Mar 13 19:46:38 2016 +0200 @@ -52,6 +52,74 @@ During dictionaries compilation and releases ``".gadict"`` suffix changed to appropriated but base name should be preserved as ``"gadict_" NAME``. +Dictionary source file format +============================= + +gadict project uses dictd C5 source file format in the past. C5 format have +several issues: + + * C5 is not structural format. So producing another forms and conversion to + other formats is not possible. + * C5 have no markup for links neither for any other markups. + +Before that project used dictd TAB file format which require placing article in +a single long line. That format is not for human editing at all. + +Other dictionary source file formats are considered as choice, like TEI, ISO, +xdxf, MDF. XML like formats also are not for human editing. Also XML lack of +syntax locality and full file should be scanned to validate local changes... + +Note that StarDict, AbbyLinguo, Babylon, dictd formats are not considered +because they all about a presentation but not a structure. They are target +formats for compilation. + +Fancy looking analog to MDF + C5 was developed. + +Beginning of file describe dictionary information. + +Each article separated by ``\n__\n\n`` and consists of two parts: + + * word variations with pronunciation + * word translations, with supplementary information like part of speach, + synonyms, antonyms, example of usage + +*Word variation* are: + +* *singularity* or *number*: ``s`` - single, ``pl`` - plural. +* *verb voice* or *verb tense*: ``v1`` - infinitive, ``v2`` - past tense, + ``v3`` past participle tense. +* *gender*: ``male`` or ``female`` +* *comparison*: ``comp`` - comparative or ``super`` - superlative + +*Parts of speech* are: + +* ``n`` - noun +* ``pron`` - pronoun +* ``adj`` - adjective +* ``v`` - verb +* ``adv`` - adverb +* ``prep`` - preposition +* ``conj`` - conjunction +* ``int`` - interjection + +Each meaning may refer to topics, like: + +* ``sci`` - about science +* ``body`` - part of body +* ``math`` - mathematics +* ``chem`` - chemicals +* ``bio`` - biology +* ``music`` +* ``meal``, ``office``, etc +* ``size``, ``shape``, ``age``, ``color`` + +Synonyms marked by ``syn:``, antonyms marked by ``ant:``. + +Translation marked by lowercase ISO 639-1 code, like ``en:``, ``ru:``, ``uk:``. + +Pronunciation variants marked by ``Am`` - American, ``Br`` - Great Britain, +``Au`` - Australia. + C5 dictionary source file format ================================ @@ -70,11 +138,7 @@ unchanged. * UTF-8 encoding is supported at least by Goldendict. -gadict project used C5 format in the past but switched to own format due to: - - * C5 is not structural format. So producing another forms and conversion to - other formats is not possible. - * C5 have no markup for links neither for any other markups. +gadict project used C5 format in the past but switched to own format. TODO convention ===============