Frequency wordlists use several statistics.
.. -*- coding: utf-8 -*-
gadict HACKING guide
.. contents::
Versioning rules
We use **major.minor** schema.
Until we reach 5000 words **major** is 0. **minor** updated from time to time.
Getting sources
Cloning repository::
$ hg clone gadict
$ hg clone gadict-hg
Pushing changes::
$ hg push ssh://$
$ hg push ssh://$
$ hg push https://$USER:$
Browsing sources online
hgweb at home page.
hgweb at old home page (but supported as mirror).
Sourceforge Allure interface (not primary, a mirror).
Dictionary file name convention
BNF form::
FILE ::= "gadict_" NAME ".gadict"
``NAME`` may have form ``ISOCODE "-" ISOCODE``, like ``en-ru``, where
``ISOCODE`` is ISO 639-1 language (2 letter) code
``NAME`` may be a dictionary abbreviation name.
During dictionaries compilation and releases ``".gadict"`` suffix changed to
appropriated but base name should be preserved as ``"gadict_" NAME``.
Dictionary source file format
gadict project uses dictd C5 source file format in the past. C5 format have
several issues:
* C5 is not structural format. So producing another forms and conversion to
other formats is not possible.
* C5 have no markup for links neither for any other markups.
Before that project used dictd TAB file format which require placing article in
a single long line. That format is not for human editing at all.
Other dictionary source file formats are considered as choice, like TEI, ISO,
xdxf, MDF. XML like formats also are not for human editing. Also XML lack of
syntax locality and full file should be scanned to validate local changes...
Note that StarDict, AbbyLinguo, Babylon, dictd formats are not considered
because they all about a presentation but not a structure. They are target
formats for compilation.
Fancy looking analog to MDF + C5 was developed.
Beginning of file describe dictionary information.
Each article separated by ``\n__\n\n`` and consists of two parts:
* word variations with pronunciation
* word translations, with supplementary information like part of speach,
synonyms, antonyms, example of usage
*Word variation* are:
* *singularity* or *number*: ``s`` - single, ``pl`` - plural.
* *verb voice* or *verb tense*: ``v1`` - infinitive, ``v2`` - past tense,
``v3`` past participle tense.
* *gender*: ``male`` or ``female``.
* *comparison*: ``comp`` - comparative or ``super`` - superlative.
*Parts of speech* are:
* ``v`` - verb
* ``n`` - noun
* ``pron`` - pronoun
* ``adv`` - adverb
* ``adj`` - adjective
* ``prep`` - preposition
* ``conj`` - conjunction
* ``num`` - numeral
* ``int`` - interjection
* ``abbr`` - abbreviation
* ``phr`` - phrase
* ``phr.v`` - phrasal verb
* ``contr`` - contraction
* ``prefix`` - word prefix
.. note:: I try to keep word meanings in article in above POS order.
Each meaning may refer to topics, like:
* ``sci`` - about science
* ``body`` - part of body
* ``math`` - mathematics
* ``chem`` - chemicals
* ``bio`` - biology
* ``music``
* ``meal``, ``office``, etc
* ``size``, ``shape``, ``age``, ``color``
* ``archaic`` - old fashioned, no longer used
Word relations:
* ``syn:`` - synonyms
* ``ant:`` - antonyms
* ``hyper:`` - hypernyms
* ``hypo:`` - hyponyms
* ``rel:`` - related (see also) terms
* ``topic:`` - topics/tags
Translation marked by lowercase ISO 639-1 code with ``:`` (colon) character,
* ``en:`` - English
* ``ru:`` - Russian
* ``uk:`` - Ukrainian
* ``la:`` - Latin
Example marked by lowercase ISO 639-1 code with ``>`` (greater) character.
Explanation or glossary marked by lowercase ISO 639-1 code with ``=`` (equal)
Pronunciation variants marked by:
* ``Am`` - American
* ``Br`` - Great Britain
* ``Au`` - Australian
C5 dictionary source file format
For source file format used dictd C5 file format. See::
$ man 1 dictfmt
* Headwords was preceded by 5 or more underscore characters ``_`` and a blank
* Article may have several headwords, in that case they are placed in one line
and separated by ``;<SPACE>``.
* All text until the next headword is considered as the definition.
* Any leading ``@`` characters are stripped out, but the file is otherwise
* UTF-8 encoding is supported at least by Goldendict.
gadict project used C5 format in the past but switched to own format.
TODO convention
Entries or parts of text that was not completed marked by keywords:
urgent incomplete
Makefile rules ``todo`` find this occurrence in sources::
$ make todo
World wide dictionary formats and standards
Dictionary writing system
Multi-Dictionary Formatter (MDF). It defines about 100 data
field markers.
FieldWorks Language Explorer (or FLEx, for short) is designed
to help field linguists perform many common language
documentation and analysis tasks.
LIFT (Lexicon Interchange FormaT) is an XML format for storing
lexical information, as used in the creation of dictionaries.
It's not necessarily the format for your lexicon.
Lexique Pro is an interactive lexicon viewer and editor, with
hyperlinks between entries, category views, dictionary
reversal, search, and export tools. It's designed to display
your data in a user-friendly format so you can distribute it
to others.
DEBII — Dictionary Editor and Browser
Word lists
Frequency wordlists use several statistics:
* number of word occurrences in corpus, usually marked by ``F``
* adjusted number of occurrences per 1.000.000 in corpus, usually marked by
* Standard Frequency Index (SFI) is a:
.. math:: SFI = 40 + 10 * log_10(U)
=== ================
SFI Freq
=== ================
90 1 per 10
80 1 per 100
70 1 per 1000
60 1 per 10.000
50 1 per 100.000
40 1 per 1.000.000
30 1 per 10.000.000
=== ================
* deviation of word frequency across documents in corpus, usually marked by
Sorting numerically on first= column::
$ sort -k 1nr,2 <$IN >$OUT
OANC frequency wordlist
The Open American National Corpus (OANC) is a roughly 15 million word subset of
the ANC Second Release that is unrestricted in terms of usage and
I've got OANC from link:
After unpacking only ``.txt`` files::
$ unzip '*.txt'
$ cd OANC; find . -type f | xargs cat | wc
2090929 14586935 96737202
I built frequency list with:
A freeware corpus analysis toolkit for concordancing and text analysis.
manually removed single and double letter words, filter out misspelled words
with ``en_US`` ``hunspell`` spell-checker and merged word variations to baseform
with using WordNet. See details in ``obsolete/``.
OANC download page.
OANC home page.
Register gadict dictionaries for dictd under Debian
$ su
$ cat >>etc/dictd/dictd.order <<EOF
$ dictdconfig --write
$ /etc/init.d/dictd restart
$ ^D
$ dictdconfig --list
$ dict -d gadict-dictabbr v
Typing IPA chars in Emacs
For entering IPA chars use IPA input method. To enable it type::
C-u C-\ ipa <enter>
All chars from alphabet typed as usual. To type special IPA chars use next key
bindings (or read help in Emacs by ``M-x describe-input-method`` or ``C-h I``).
For vowel::
æ ae
ɑ o| or A
ɒ |o or /A
ʊ U
ɛ /3 or E
ɔ /c
ə /e
ʌ /v
ɪ I
For consonant::
θ th
ð dh
ʃ sh
ʧ tsh
ʒ zh or 3
ŋ ng
ɡ g
ɹ /r
Special chars::
ː : (semicolon)
ˈ ' (quote)
ˌ ` (back quote)
Alternatively use ``ipa-x-sampa`` or ``ipa-kirshenbaum`` input method (for help
type: ``C-h I ipa-x-sampa RET`` or ``C-h I ipa-kirshenbaum RET``).