ticclat.utils module

Non-database related utility functions for TICCLAT.

ticclat.utils.anahash_df(wfreq, alphabet_file)[source]

Get anahash values for word frequency data.

The result can be used to add anahash values to the database (ticclat.dbutils.bulk_add_anahashes) and connect wordforms to anahash values (ticclat.dbutils.connect_anahases_to_wordforms).

Inputs:
wfreq (pandas DataFrame): Dataframe containing word frequency data (the
result of ticcl.dbutils.get_word_frequency_df)

alphabet_file (str): path to the ticcl alphabet file to use

Returns:pandas DataFrame containing the word forms as index and anahash values as column.
ticclat.utils.chunk_df(df, batch_size=1000)[source]

Generator that returns about equally size chunks from a pandas DataFrame

Inputs:

df (DataFrame): the DataFrame to be chunked batch_size (int, default 10000): the approximate number of records that will

be in each chunk
ticclat.utils.chunk_json_lines(file_handle, batch_size=1000)[source]

Read a JSON file and yield lines in batches.

ticclat.utils.count_lines(file_handle)[source]

From https://stackoverflow.com/q/845058/1199693

ticclat.utils.get_temp_file()[source]

Create a temporary file and its file handle.

Returns:File handle of the temporary file.
ticclat.utils.iterate_wf(lst)[source]

Generator that yields {‘wordform’: value} for all values in lst.

ticclat.utils.json_line(obj)[source]

Convert an object obj to a string containing a line of JSON.

ticclat.utils.morph_iterator(morph_paradigms_per_wordform, mapping)[source]

Generator that yields dicts of morphological paradigm code components plus wordform_id in the database.

Inputs:
morph_paradigms_per_wordform: dictionary with wordforms (keys) and
lists (values) of dictionaries of code components (return values of split_component_code).
mapping: iterable of named tuples / dictionaries that contain the
result of a query on the wordforms table, i.e. fields ‘wordform’ and ‘wordform_id’.
ticclat.utils.preprocess_wordforms(wfs, columns=None)[source]

Clean wordforms in dataframe wfs.

Strips whitespace, replaces underscores with asterisks (misc character) and spaces with underscores.

ticclat.utils.read_json_lines(file_handle)[source]

Generator that reads a dictionary per line from a file

This can be used when doing mass inserts (i.e., inserts not using the ORM) into the database. The data that will be inserted is written to file (using write_json_lines), so it can be read and inserted into the database without using a lot of memory.

Inputs:
file_handle: File handle of the file containing the data, one dictionary
(JSON) object per line
Returns:iterator over the lines in the input file
ticclat.utils.read_ticcl_variants_file(fname)[source]

Return dataframe containing data in TICCL variants file.

ticclat.utils.set_logger(level='INFO')[source]

Configure logging format and level.

ticclat.utils.split_component_code(code, wordform)[source]

Split morphological paradigm code into its components.

Morphological paradigm codes in Reynaert’s encoding scheme consist of 8 subcomponents. These are returned as separate entries of a dictionary from this function.

ticclat.utils.timeit(method)[source]

Decorator for timing methods.

Can be used for benchmarking queries.

Source: https://medium.com/pythonhive/fa04cb6bb36d

ticclat.utils.write_json_lines(file_handle, generator)[source]

Write a sequence of dictionaries to file, one dictionary per line

This can be used when doing mass inserts (i.e., inserts not using the ORM) into the database. The data that will be inserted is written to file, so it can be read (using read_json_lines) without using a lot of memory.

Inputs:
file_handle: File handle of the file to save the data to generator (generator): Generator that produces objects to write to file
Returns:the number of records written.
Return type:int