Bitext Lexical Data Resources are the most comprehensive and consistent set of
language data resources in the world, with support for +100 languages and dialects.


Download a file with the specifications of 25 of our languages.

Example: Specifications for Finnish (FI) Language Data

Inflectional forms list: includes all the standard inflectional forms for nouns, verbs, adjectives, pre- and postpositions, conjunctions, etc. Each form is annotated with the lemma (root form), POS, and morphological attributes (voice, tense, mood, person, number, possessive-person, possessive-number, case, degree, connegative).

Derivational forms list: includes all the standard derivational forms including verbs derived from nouns, nouns derived from verbs, adjectives derived from nouns, nouns with possessive forms, comparatives and superlatives for adjectives and adverbs. Each form is annotated with the lemma (root form), POS, and morphological attributes (voice, tense, mood, person, number, possessive-person, possessive-number, case, degree, connegative).

Extended forms list: includes the result of extending the inflectional and derivational forms lists as a result of considering additional morphological phenomena such as common productive suffixes. Each form is annotated with the lemma (root form), POS, and morphological attributes (voice, tense, mood, person, number, possessive-person, possessive-number, case, degree, connegative).

Named Entities forms list: includes the data regarding named entities comprising person names, places, companies and organizations. Each form is annotated with the lemma (root form), POS, and morphological attributes (voice, tense, mood, person, number, possessiveperson, possessive-number, case, degree, connegative and entity-type).

Frequency indication: includes the data regarding the relative frequency of appearance for the words in the above lists in the given language. The relative frequency could be in the range of 0-255, or as requested.

Offensive language flag: includes information per word indicating if the word might be considered offensive in certain contexts.

Volume of Language Data

  • Total number of lemmas: 70,000 lemmas
  • Total number of forms: 80 million forms
    • Verbs: 25,000,000 forms (31%)
    • Nouns: 45,000,000 forms (56%)
    • Adjectives: 10,000,000 forms (12%)
    • Other: 40,000 forms (1%)