ticclat.utils module¶
Non-database related utility functions for TICCLAT.
-
ticclat.utils.
anahash_df
(wfreq, alphabet_file)[source]¶ Get anahash values for word frequency data.
The result can be used to add anahash values to the database (ticclat.dbutils.bulk_add_anahashes) and connect wordforms to anahash values (ticclat.dbutils.connect_anahases_to_wordforms).
- Inputs:
- wfreq (pandas DataFrame): Dataframe containing word frequency data (the
- result of ticcl.dbutils.get_word_frequency_df)
alphabet_file (str): path to the ticcl alphabet file to use
Returns: pandas DataFrame containing the word forms as index and anahash values as column.
-
ticclat.utils.
chunk_df
(df, batch_size=1000)[source]¶ Generator that returns about equally size chunks from a pandas DataFrame
- Inputs:
df (DataFrame): the DataFrame to be chunked batch_size (int, default 10000): the approximate number of records that will
be in each chunk
-
ticclat.utils.
chunk_json_lines
(file_handle, batch_size=1000)[source]¶ Read a JSON file and yield lines in batches.
-
ticclat.utils.
get_temp_file
()[source]¶ Create a temporary file and its file handle.
Returns: File handle of the temporary file.
-
ticclat.utils.
iterate_wf
(lst)[source]¶ Generator that yields {‘wordform’: value} for all values in lst.
-
ticclat.utils.
morph_iterator
(morph_paradigms_per_wordform, mapping)[source]¶ Generator that yields dicts of morphological paradigm code components plus wordform_id in the database.
- Inputs:
- morph_paradigms_per_wordform: dictionary with wordforms (keys) and
- lists (values) of dictionaries of code components (return values of split_component_code).
- mapping: iterable of named tuples / dictionaries that contain the
- result of a query on the wordforms table, i.e. fields ‘wordform’ and ‘wordform_id’.
-
ticclat.utils.
preprocess_wordforms
(wfs, columns=None)[source]¶ Clean wordforms in dataframe wfs.
Strips whitespace, replaces underscores with asterisks (misc character) and spaces with underscores.
-
ticclat.utils.
read_json_lines
(file_handle)[source]¶ Generator that reads a dictionary per line from a file
This can be used when doing mass inserts (i.e., inserts not using the ORM) into the database. The data that will be inserted is written to file (using
write_json_lines
), so it can be read and inserted into the database without using a lot of memory.- Inputs:
- file_handle: File handle of the file containing the data, one dictionary
- (JSON) object per line
Returns: iterator over the lines in the input file
-
ticclat.utils.
read_ticcl_variants_file
(fname)[source]¶ Return dataframe containing data in TICCL variants file.
-
ticclat.utils.
split_component_code
(code, wordform)[source]¶ Split morphological paradigm code into its components.
Morphological paradigm codes in Reynaert’s encoding scheme consist of 8 subcomponents. These are returned as separate entries of a dictionary from this function.
-
ticclat.utils.
timeit
(method)[source]¶ Decorator for timing methods.
Can be used for benchmarking queries.
-
ticclat.utils.
write_json_lines
(file_handle, generator)[source]¶ Write a sequence of dictionaries to file, one dictionary per line
This can be used when doing mass inserts (i.e., inserts not using the ORM) into the database. The data that will be inserted is written to file, so it can be read (using
read_json_lines
) without using a lot of memory.- Inputs:
- file_handle: File handle of the file to save the data to generator (generator): Generator that produces objects to write to file
Returns: the number of records written. Return type: int