ticclat.ticclat_schema module¶
SQLAlchemy schema of the TICCLAT database.
Contains all the tables of the database and their connections, defined as SQLAlchemy declarative_base subclasses.
Many of the tables here defined are based on an INT lexicon database created in the IMPACT project (https://ivdnt.org/images/stories/onderzoek_en_onderwijs/publicaties/impact/impact_lexicon_structure.pdf). See https://github.com/TICCLAT/docs/blob/master/database_design.md for more information about the database design.
Based on this, in TICCLAT, we added tables for: - links between wordforms - morphological paradigm groups of wordforms - anagram hashes from TICCL - spelling variants from TICCL - identifiers linking wordforms to external sources like the WNT, MNW, INT.
-
class
ticclat.ticclat_schema.
Anahash
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table for storing anahashes.
The anahashes in this table have no direct relation to the wordforms, those links are tracked in the wordforms table. This was done so that the anahashes table can be efficiently searched, e.g. for ranges in anahash “space”.
-
anahash
¶
-
anahash_id
¶
-
-
class
ticclat.ticclat_schema.
Corpus
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table for storing corpus metadata.
-
corpus_documents
¶
-
corpus_id
¶
-
name
¶
-
-
class
ticclat.ticclat_schema.
Document
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table for storing document metadata.
-
document_corpora
¶
-
document_id
¶
-
document_wordforms
¶
-
editor
¶
-
encoding
¶
-
language
¶
-
other_languages
¶
-
parent_document
¶
-
persistent_id
¶
-
pub_year
¶
-
publisher
¶
-
publishing_location
¶
-
region
¶
-
spelling
¶
-
text_type
¶
-
title
¶
-
word_count
¶
-
year_from
¶
-
year_to
¶
-
-
class
ticclat.ticclat_schema.
ExternalLink
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table for storing ids from external sources of wordforms.
Used for linking wordforms to external sources, such as the WNT, MNW, INT.
-
external_link_id
¶
-
source_id
¶
-
source_name
¶
-
wordform_id
¶
-
-
class
ticclat.ticclat_schema.
Lexicon
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table for storing lexicon metadata.
- vocabulary (bool): if True, all words in this lexicon are (supposed to be)
- valid words, if False, some are misspelled
-
lexicon_id
¶
-
lexicon_name
¶
-
lexicon_wordform_links
¶
-
lexicon_wordforms
¶
-
vocabulary
¶
-
class
ticclat.ticclat_schema.
MorphologicalParadigm
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table for storing information about morphological paradigms of wordforms.
The paradigms are determined according to Reynaert’s method (to be published).
-
V
¶
-
W
¶
-
X
¶
-
Y
¶
-
Z
¶
-
paradigm_id
¶
-
word_type_code
¶
-
word_type_number
¶
-
wordform_id
¶
-
-
class
ticclat.ticclat_schema.
TextAttestation
(document, wordform, frequency)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table for storing text attestations.
A text attestation entry is defined in the INT schema as the occurrence and frequency of wordforms in documents.
-
attestation_id
¶
-
document_id
¶
-
frequency
¶
-
ta_document
¶
-
ta_wordform
¶
-
wordform_id
¶
-
-
class
ticclat.ticclat_schema.
TicclatVariant
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Contains spelling variants of words, ingested from TICCL
-
frequency
¶
-
levenshtein_distance
¶
-
ticclat_variant_id
¶
-
wordform
¶
-
wordform_source
¶
-
wordform_source_id
¶
-
-
class
ticclat.ticclat_schema.
Wordform
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table for storing wordforms and associated anahashes.
-
anahash
¶
-
anahash_id
¶
-
link
(wordform)[source]¶ Add WordformLinks between self and another wordfrom and vice versa.
The WordformLinks are added only in the link does not yet exist.
- Inputs:
- wordform (Wordform): Wordform that is related to Wordform self.
-
link_spelling_correction
(corr, lexicon)[source]¶ Add a spelling correction WordformLink.
This method sets the booleans that indicate which Wordforms are correct (according to the lexicon).
- Inputs:
- corr (Wordform): A correction candidate of Wordform self lexicon (Lexicon): The Lexicon that contains the WordformLink
-
link_with_metadata
(wf_to, wf_from_correct, wf_to_correct, lexicon)[source]¶ Add WordformLinks with metadata.
Adds a WordformLink between self and another wordfrom, and vice versa, if these links are not yet in the database. And adds a WordformLinkSource, with Lexicon, and information about which Wordforms are correct according to the Lexicon. No duplicate WordformLinkSources are added.
TODO: add Uniqueconstraint on (wf_from (self), wf_to, lexicon)?
- Inputs:
wf_to (Wordform): Wordform self will be linked to (and vice versa) wf_from_correct (boolean): True if Wordform self is correct
according to the lexicon, False otherwise.- wf_to_correct (boolean): True if Wordform wf_to is correct
- according to the lexicon, False otherwise.
lexicon (Lexicon): The Lexicon that contains the WordformLink
-
wf_lexica
¶
-
wordform
¶
-
wordform_documents
¶
-
wordform_id
¶
-
wordform_lowercase
¶
-
-
class
ticclat.ticclat_schema.
WordformFrequencies
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Materialized view containing overall frequencies of wordforms
The data in this table can be used to filter wordforms on frequency. This is necessary, because there is a lot of noise in the wordforms table, and this makes aggregating over all wordforms expensive.
-
frequency
¶
-
wordform
¶
-
wordform_id
¶
-
-
class
ticclat.ticclat_schema.
WordformLink
(wf1, wf2)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table for storing links between wordforms.
-
linked_from
¶
-
linked_to
¶
-
wordform_from
¶
-
wordform_to
¶
-
-
class
ticclat.ticclat_schema.
WordformLinkSource
(wflink, wf_from_correct, wf_to_correct, lexicon)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table for storing the sources of links between wordforms.
Wordform links are given by lexica (dictionaries, spelling correction lists, etc.). This table records which lexicon a given link between wordforms was originally ingested from.
-
anahash_difference
¶
-
ld
¶
-
lexicon_id
¶
-
source_x_wordform_link_id
¶
-
wfls_lexicon
¶
-
wfls_wflink
¶
-
wordform_from
¶
-
wordform_from_correct
¶
-
wordform_to
¶
-
wordform_to_correct
¶
-