HAWKS Package

This page is automatically generated using sphinx-apidoc, and is meant as a quick, searchable reference only (as opposed to a guide). The list is purposefully not complete. If a particular e.g. function is missing, see the source code for further information.

In general, only hawks.generator and hawks.plotting should be interacted with directly.

hawks.analysis

Defines the clustering algorithms and handles running them. Primarily used for analysis and instance space generation.

hawks.analysis.analyse_datasets(generator=None, datasets=None, label_sets=None, cluster_subset=None, feature_subset=None, seed=None, source='HAWKS', prev_df=None, clustering=True, feature_space=True, save=True, save_folder=None, filename='dataset_analysis')

Function to analyze the datasets, either by their problem_features, clustering algorithm performance, or both.

Parameters:
  • generator (BaseGenerator, optional) – HAWKS generator instance (that contains datasets). Defaults to None.
  • datasets (list, optional) – A list of the datasets to be examined. Defaults to None.
  • label_sets (list, optional) – A list of labels that match the list of datasets. Defaults to None.
  • cluster_subset (list, optional) – A list of clustering algorithms to use. Defaults to None, where all default clustering algorithms (specified in define_cluster_algs()) are used.
  • feature_subset (list, optional) – A list of problem features to use. Defaults to None, where all problem features (specified in problem_features) are used.
  • seed (int, optional) – Random seed number. Defaults to None, where it is randomly selected.
  • source (str, optional) – Name of the set of datasets. Useful for organizing/analyzing/plotting results. Defaults to “HAWKS”.
  • prev_df (DataFrame, optional) – Pass in a previous DataFrame, with which the results are added to. Defaults to None, creating a blank DataFrame.
  • clustering (bool, optional) – Whether to run clustering algorithms on the datasets or not. Defaults to True.
  • feature_space (bool, optional) – Whether to run the problem features on the datasets or not. Defaults to True.
  • save (bool, optional) – Whether to save the results or not. Defaults to True.
  • save_folder (str, pathlib.Path, optional) – Where to save the results. Defaults to None, where the location of the BaseGenerator is used. If no BaseGenerator instance was given, create a folder in the working directory.
  • filename (str, optional) – Name of the CSV file to be saved. Defaults to “dataset_analysis”.
Returns:

2-element tuple containing:

DataFrame: DataFrame with results for each dataset.

pathlib.Path: The path to the folder where the results are saved.

Return type:

(tuple)

hawks.analysis.define_cluster_algs(seed)

Defines some default clustering algorithms. Currently uses four simple algorithms: average-linkage, GMM, K-Means++, and single-linkage.

Parameters:seed (int) – Random seed given to the algorithms. int is generally fine, but depends on the algorithm implementation.
Returns:A dict where each key is the name of the algorithm, with "class" as a callable to create (and fit) the model, any "kwargs" it needs, and "k_multiplier" if anything other than the true number of clusters is desired.
Return type:dict

Todo

Extend functionality for arbitrary clustering algorithms

hawks.analysis.determine_num_clusters(col_name, alg_kwargs, multiplier, labels)

Function to extract the number of clusters for the dataset (requires labels, this isn’t an estimation process).

Parameters:
  • col_name (str) – Name of the algorithm.
  • alg_kwargs (dict) – Arguments for the clustering algorithm.
  • multiplier (float) – Multiplier for the number of clusters.
  • labels (list) – The labels for this dataset. Can be a list or numpy.ndarray.
Raises:

KeyError – Incorrect algorithm name given.

Returns:

The algorithm’s arguments with the cluster number added.

Return type:

dict

hawks.analysis.run_clustering(datasets, label_sets, config_nums, alg_dict, df, source)

Function to actually run the clustering algorithms and add results to the DataFrame.

Parameters:
  • datasets (list, optional) – A list of the datasets to be examined. Defaults to None.
  • label_sets (list, optional) – A list of labels that match the list of datasets. Defaults to None.
  • config_nums (list) – A list of the config numbers (only relevant for HAWKS, not external datasets). Allows linking of datasets to parameter configuration.
  • alg_dict (dict) – Dictionary of the clustering algorithms. Defined in define_cluster_algs().
  • df (DataFrame) – DataFrame to add the results to.
  • source (str) – Name of the set of datasets.
Returns:

DataFrame with the clustering results.

Return type:

DataFrame

hawks.analysis.run_feature_space(datasets, label_sets, config_nums, feature_dict, df, source)

Function to actually run the problem features on the datasets and add results to the DataFrame.

Parameters:
  • datasets (list, optional) – A list of the datasets to be examined. Defaults to None.
  • label_sets (list, optional) – A list of labels that match the list of datasets. Defaults to None.
  • config_nums (list) – A list of the config numbers (only relevant for HAWKS, not external datasets). Allows linking of datasets to parameter configuration.
  • feature_dict (dict) – Dictionary of the problem features to be used.
  • df (DataFrame) – DataFrame to add the results to.
  • source (str) – Name of the set of datasets.
Returns:

DataFrame with the clustering results.

Return type:

DataFrame

hawks.cluster

Defines the Cluster class, which represents a single cluster. Contains properties of the cluster (size, mean, covariance, data point values etc.).

Responsible for the methods defining the mutation of a cluster.

class hawks.cluster.Cluster(size)

Bases: object

Class for properties about a single cluster. Methods for manipulating a single cluster (such as mutation), and the initialization, go here.

Variables:
  • id_value (itertools.count()) – Unique ID value for each cluster .
  • global_rng (RandomState) – The global RandomState instance used as a common RNG.
  • num_dims (int) – Number of dimensions.
  • num_clusters (int) – Number of clusters.
  • cluster_sizes (list) – List of cluster sizes.
  • initial_mean_upper (float) – Upper range to sample the means from.
  • initial_cov_upper (float) – Upper range to sample the variances from.
  • size (int) – The size of the cluster (number of data points).
gen_initial_cov(method='eigen')

Generates the initial (axis-aligned) covariance matrix.

Parameters:method (str, optional) – Method to generate the covariance. Defaults to “eigen”.
gen_initial_mean()

Generate the initial mean vector for the Cluster().

initial_cluster_setup()

Sets up a Cluster instance

mean = None

Mean of the cluster (Gaussian)

Type:array
mutate_cov_haar(power)

Mutation operator for the covariance (Haar operator)

Parameters:power (float) – The power to reduce (when <1) the rotation matrix to avoid too large a change in the covariance. Behaviour when >1 is undocumented, and probably bad.
mutate_mean_random(scale, dims, **kwargs)

Random mutation operator for the mean

Parameters:
  • scale (float) – Width of the Gaussian sampled from to shift the mean
  • dims (str) – Option to test each dimension seperately for mutation, or mutate in all simultaneously
Raises:

ValueError – If a valid option is not supplied for dims

Returns:

Returns the new mean vector

Return type:

list

sample_values()

Samples values from the defined distribution (Gaussian with the instance’s mean, cov, and size attributes)

set_seed()

Generate and set the random seed.

set_state()

Set the random state using the pre-defined seed number for this cluster. Organised like this so we can reset the state to sample, using our static seed number.

classmethod setup_variables(dataset_obj, ga_params)

Shares variables from the Dataset classthat are needed here.

Parameters:
  • dataset_obj (Dataset) – Dataset instance for this run
  • ga_params (dict) – GA parameters from the main config

hawks.constraints

All functions in this script are scraped via the inspect module when checking the arguments given in the config under the ‘constraints’ object/subdict.

A single value should be returned for any constraint.

hawks.constraints.eigenval_ratio(indiv)

Calculate the eigenvalue ratio (or amount of eccentricity). This is ratio between the largest and smallest eigenvalues of the diagonal covariance matrix.

Parameters:indiv (Genotype) – A single individual (i.e. a dataset).
Returns:The ratio of the largest to smallest eigenvalues.
Return type:float
hawks.constraints.overlap(indiv)

Calculate the amount of overlap (the percentage of points whose nearest neighbour is in a different cluster).

Parameters:indiv (Genotype) – A single individual (i.e. a dataset).
Returns:The percentage of overlap between clusters.
Return type:float

hawks.dataset

Defines the Dataset class, which handles general properties of the dataset that is being evolved (and is consistent across individuals in the population).

class hawks.dataset.Dataset(num_examples, num_clusters, num_dims, equal_clusters, min_clust_size)

Bases: object

Class for properties about the dataset (such as the size, number of dimensions) as a whole go here.

Separation may be useful for potential future development.

Variables:
  • num_examples (int) – Total size of the dataset.
  • num_clusters (int) – Number of clusters.
  • num_dims (int) – Number of dimensions.
  • equal_clusters (bool) – If the clusters should be equally sized.
  • min_clust_size (int) – Minimum size (number of data points) a cluster should have.
  • global_rng (RandomState) – The global RandomState instance used as a common RNG.

hawks.ga

Handles everything related to the GA itself. This is mainly setting up DEAP, processing the GA-specific arguments, and defining all the relevant aspects of the evolution (such as the parental selection, environmental selection etc.).

hawks.ga.create_toolbox(objective_dict, dataset_obj, ga_params)

Function to create the toolbox, calling the relevant selection functions based on the parameters.

Parameters:
  • objective_dict (dict) – Dictionary with the objective function and its arguments.
  • dataset_obj (Dataset) – Dataset instance for this run.
  • ga_params (dict) – GA parameters from the config.
Returns:

DEAP toolbox.

Return type:

Toolbox

hawks.ga.deap_setup(objective_dict, dataset_obj, ga_params)

Setup the DEAP toolbox.

Parameters:
  • objective_dict (dict) – Dictionary with the objective function and its arguments.
  • dataset_obj (Dataset) – Dataset instance for this run.
  • ga_params (dict) – GA parameters from the config.
Returns:

DEAP toolbox.

Return type:

Toolbox

hawks.ga.evaluate_indiv(indiv, objective_dict)

Wrapper function for calculating the individual’s fitness. See Objective for implementation details.

Parameters:
  • indiv (Genotype) – A single individual (i.e. a dataset).
  • objective_dict (dict) – Dictionary with the objective function and its arguments.
Returns:

Objective/fitness values.

Return type:

tuple

hawks.ga.generation(pop, toolbox, constraint_params, cxpb)

Function to execute each generation. The order and different elements are modified here, the actual nature of the components is dependent on what is in the toolbox.

Parameters:
  • pop (list) – The population of individuals.
  • toolbox (Toolbox) – DEAP toolbox.
  • constraint_params (dict) – Constraint parameters from the config.
  • cxpb (float) – Crossover probability.
Returns:

The population after this generation.

Return type:

list

hawks.ga.main_setup(objective_dict, dataset_obj, ga_params, constraint_params)

Central function to setup DEAP and the GA.

Parameters:
  • objective_dict (dict) – Dictionary with the objective function and its arguments.
  • dataset_obj (Dataset) – Dataset instance for this run.
  • ga_params (dict) – GA parameters from the config.
  • constraint_params (dict) – Constraint parameters from the config.
Returns:

2-element tuple containing:

Toolbox: DEAP toolbox.

list: The initialized population.

Return type:

tuple

hawks.ga.select_crossover(toolbox, ga_params)

Function to select the crossover operator. New operators (added to Genotype) need to be specified here to be reachable. Current options (given in the config) are:

  • “cluster” (swap mean and covariance together)
  • “dv” (swap mean and covariance independently)
Parameters:
  • toolbox (Toolbox) – DEAP toolbox.
  • ga_params (dict) – GA parameters from the config.
Returns:

DEAP toolbox.

Return type:

Toolbox

hawks.ga.select_environ_func(toolbox, ga_params)

Function to select the environmental selection method. New methods (added in this module) need to be specified here to be reachable. Current options (given in the config) are:

  • “sr” or “stochastic ranking” (stochastic ranking)
Parameters:
  • toolbox (Toolbox) – DEAP toolbox.
  • ga_params (dict) – GA parameters from the config.
Returns:

DEAP toolbox.

Return type:

Toolbox

hawks.ga.select_mutation(toolbox, dataset_obj, ga_params)

Function to select the mutation operator. New operators (added to Cluster) need to be specified here to be reachable. Current options (given in the config) are:

  • “random”
Parameters:
  • toolbox (Toolbox) – DEAP toolbox.
  • dataset_obj (Dataset) – Dataset instance for this run.
  • ga_params (dict) – GA parameters from the config.
Returns:

DEAP toolbox.

Return type:

Toolbox

hawks.ga.select_parent_func(toolbox, ga_params)

Function to select the parental selection method. New methods (added in this module) need to be specified here to be reachable. Current options (given in the config) are:

  • “binary” or “tournament” (ranking-based tournament selection)
  • “tournament-fitness” (fitness-based tournament selection)
Parameters:
  • toolbox (Toolbox) – DEAP toolbox.
  • ga_params (dict) – GA parameters from the config.
Returns:

DEAP toolbox.

Return type:

Toolbox

hawks.ga.stochastic_ranking(pop, ga_params)

Implementation of the stochastic ranking.

Parameters:
  • pop (list) – The population of individuals.
  • ga_params (dict) – GA parameters from the config.
Returns:

Sorted population

Return type:

list

hawks.generator

Defines the overarching Generator class for HAWKS. Intended to be the outward-facing class that users interact with, pulling everything else together.

class hawks.generator.BaseGenerator(config, any_saving, multi_config)

Bases: object

Base class for the generator, providing a set of common functionality required for all future classes/functionality. Not all attributes are shown here, but those that are not are derived from the config.

Variables:
  • config (dict) – The full config.
  • any_saving (bool) – Whether any of the saving options have been specified. This is automatically determined.
  • multi_config (bool) – Whether the config specifies multiple sets of parameters or not. This is automatically determined.
  • stats (DataFrame) – A DataFrame that records stats during the runs for easy analysis.
  • datasets (list) – A list of datasets (extracted arrays from the Genotype s). Created in get_best_dataset().
  • label_sets (list) – A list of labels (extracted arrays from the Genotype s). Created in get_best_dataset().
  • population (list) – An easy reference to the most recent population of individuals.
  • base_folder (Path) – The path to the root folder for this run of HAWKS. Constructued using the folder_name in the config. If one isn’t given, uses that datetime.
  • config_list (list) – A list of the unique configs. Useful for multi_config, where each combination of parameters gets a single config which is stored in this list.
  • best_each_run (list) – A list, which contains a sub-list for each config. In this sub-list, the best individual from each run is stored. For single-config runs, flattening may be needed.
static _check_multiconfig(config)

Check if a list exists in the config i.e. it defines a set of parameters. Switches to multi_config mode if so.

Parameters:config (dict) – HAWKS config.
Returns:True if a set of parameters is found.
Return type:bool

See also

Multi-config example for further information of usage.

create_folders()

Create the folder(s) necessary for saving. This function is only called if any saving option is switched on.

create_individual()

Hacky function to do the bare minimum needed to create some individuals. The initial population is generated and we yield from that. Useful for debugging, to just playing with a single dataset/individual.

full_config = None

Keep the full config

get_best_dataset(return_config=False, reset=False)

Function for extracting the data and labels of the best dataset for every run per config.

A list of the datasets (numpy arrays) and a list of the labels are returned. If specified, a list of the associated configs are also returned.

Note that these lists are flattened. In the single config case, the list of datasets will be num_runs long. In the multi_config case, a list of length num_runs``*``len(self.config_list) will be returned.

If the datasets or label_sets have not already been extracted then they are extracted. If this needs to be updated, there is a flag to reset this and extract again.

Parameters:
  • return_config (bool) – Whether the config should be returned.
  • reset (bool) – Whether to re-initialize the attributes (useful for when interacting with a run).
get_config()

Return the full config, useful in an interactive setting

get_stats()

Return the stats (DataFrame), useful in an interactive setting

static load_default_config()

Loads the default config JSON file (which is used to fill in any gaps in the provided config).

Returns:The full config.
Return type:dict
plot_best_indivs(cmap='inferno', fig_format='pdf', save=False, show=True, remove_axis=False, fig_title=None, nrows=None, ncols=None)

Plot the best individuals from each run, for each config. A separate plot is made for each config, with the best from each run plotted together.

Does the processing required to pass onto plot_datasets().

Parameters:
  • cmap (str, optional) – The colourmap from matplotlib to use. Defaults to “inferno”.
  • fig_format (str, optional) – The format to save the plot in, usually either “png” or “pdf”. Defaults to “pdf”.
  • save (bool, optional) – Save the plot. Defaults to False.
  • show (bool, optional) – Show the plot. Defaults to True.
  • remove_axis (bool, optional) – Whether to remove the axis to just show the clusters. Defaults to False.
  • fig_title (str, optional) – Figure title. Defaults to None.
  • nrows (int, optional) – Number of rows for plt.subplots, calculated if None. Defaults to None.
  • ncols (int, optional) – Number of columns for plt.subplots, calculated if None. Defaults to None.
Raises:

ValueError – If there is no best dataset found. The generator may not have been run yet.

plot_datasets(datasets, cmap='inferno', fig_format='png', save=False, show=True, remove_axis=False, filename=None, fig_title=None, nrows=None, ncols=None, folder=None, **kwargs)

Plot a set of datasets.

Parameters:
  • datasets (list) – A list of individuals (Genotype) to be plotted.
  • cmap (str) – The colourmap from matplotlib to use. Defaults to “inferno”.
  • fig_format (str) – The format to save the plot in, usually either “png” or “pdf”. Defaults to None.
  • save (bool) – Save the plot. Defaults to False.
  • show (bool) – Show the plot. Defaults to True.
  • remove_axis (bool) – Whether to remove the axis to just show the clusters. Defaults to False.
  • filename (str) – Filename (constructed if None). Defaults to None.
  • fig_title (str) – Figure title. Defaults to None.
  • nrows (int) – Number of rows for plt.subplots, calculated if None. Defaults to None.
  • ncols (int) – Number of columns for plt.subplots, calculated if None. Defaults to None.
save_config(config=None, folder=None, filename=None)

Save the config file for future reproduction.

Parameters:
  • config (dict, optional) – HAWKS config. Defaults to None, where the config is taken from the instance.
  • folder (str, pathlib.Path, optional) – The folder to save the config in. Defaults to None, where the directory of the config file/experiment script/working directory is used.
  • filename (str, optional) – Filename for the config. Defaults to None, where the experiment/folder name from the config is used instead.
set_global_rng(num_seed)

Sets the global RandomState instance for the run, allowing reproducability.

Parameters:num_seed (int) – The seed used to initialize the RNG.
class hawks.generator.SingleObjective(config, any_saving, multi_config)

Bases: hawks.generator.BaseGenerator

Class specific for optimizing to a single objective. This is currently the main mode, and is used to optimize the datasets towards as given silhouette width (according the constraints and other parameters).

animate(record_stats=False, plot_pop=True, **kwargs)

Function to animate a run of HAWKS (showing how the datasets evolve). An example of this can be found in the README. Produces a series of PNGs, and creates a gif using ImageMagick.

Parameters:
  • record_stats (bool, optional) – Whether the results of the run should be recorded (and therefore can be saved, depending on the config). Defaults to False.
  • plot_pop (bool, optional) – Whether to plot the whole population. If False, just plots the best individual. Defaults to True.
Raises:

ValueError – Animation cannot be run for a multi_config; only a single set of parameters is permitted.

run()

The main run function for the generator.

run_step()

Run function that contains the actual code, yielding after each run, if desired.

Yields:SingleObjective – The generator instance at the time, allowing inspection of the process.
hawks.generator.create_generator(config=None)

Function to create a generator (of the relevant sub-class of BaseGenerator) to be used to generate datasets. This is preferential to calling the class directly, particularly in future versions of HAWKS.

Parameters:

config (dict, str, Path, optional) – A dictionary or path to a JSON file with the parameters for HAWKS. Defaults to None, whereby the defaults are used.

Raises:
  • FileNotFoundError – If a path-like object is given, but the file cannot be found.
  • TypeError – If an object is given that cannot be interpreted into a config file (i.e. not a dict or a path-like object).
  • ValueError – If an incorrect mode for HAWKS is provided.
Returns:

The initialized generator instance.

Return type:

BaseGenerator

hawks.genotype

Defines the Genotype class, representing a single individual. Given a light wrapper by DEAP, but retains all functionality. Essentially a list of hawks.cluster.Cluster objects.

Handles the overall array of the data, as the individual clusters have views into the hawks.genotype.Genotype.all_values. Calculates the constraints for an individual if requried. Also handles the mutation (calling the defined method of the hawks.cluster.Cluster class) and crossover.

class hawks.genotype.Genotype(clusters)

Bases: list

Wrapper class for the genotype/individual. A list of Cluster objects, handling operations on the individual-level (such as crossover). Contains the actual data array, which the Cluster s view into.

Variables:
  • all_values (numpy.ndarray) –
  • constraints (dict) – Actual dictionary of the constraints and their violation.
  • constraints_dict (dict) – Container dictionary for the constraint functions we calculate. Scraped from constraints.
  • cxpb (float) – Crossover probability.
  • feasible (bool) – Whether the individual is feasible (i.e. no penalty).
  • global_rng (RandomState) –
  • labels (list) – A list (or array) of the true labels.
  • mutpb_mean (float) – Mutation probability for the mean.
  • mutpb_cov (float) – Mutation probability for the covariance.
  • penalty (float) – Constraint penalty violation.
  • positions (list) – A list of tuples that provide the indices for the start and end of each cluster (allows for partial modification of arrays).
calc_constraints(constraint_params)

Calculate the constraints for a given individual.

Parameters:constraint_params (dict) – Dictionary for the constraint parameters (from the config).
create_views()

Create the views for each of the values attribute to the main array (all_values) for this individual.

mutation(mut_mean_func, mut_cov_func)

Generic mutation function.

Arguments given must be functions for mutating the mean and covariance. Can use functools.partial() to freeze other arguments if need be - see select_mutation() for options.

Parameters:
  • mut_mean_func (Callable) – The mutation operator function for the mean.
  • mut_cov_func (Callable) – The mutation operator function for the covariance.
recalc_constraints(constraint_params)

Recalculate constraints only if a cluster has changed.

Parameters:constraint_params (dict) – Dictionary for the constraint parameters (from the config).
static reconst_values(parent1, parent2, index)

Reconstructs the values of the individual clusters, useful for when the view has become disentangled by moving Cluster objects between Genotype instances.

Parameters:
  • parent1 (Genotype) – The first parent.
  • parent2 (Genotype) – The second parent.
  • index (int) – The cluster that needs reconstruction
recreate_single_view(index)

Recreate a single cluster’s view (rather than all).

Parameters:index (int) – The index of the cluster
recreate_views()

Recreate the numpy array views to ensure values update when the individual Cluster s are changed.

resample_values()

Loop over the genotype and resample the values for any cluster that has been changed.

save_clusters(folder, fname)

Save the dataset (values and corresponding labels) defined by this individual, given a folder and filename. Saves as a CSV via numpy.savetxt().

Parameters:
  • folder (str, pathlib.Path) – The path to the folder where the datasets should be saved.
  • fname (str) – The filename to use.
classmethod validate_constraints(constraint_params)

Ensure that the constraints in the config match what is available.

Parameters:constraint_params (dict) – Dictionary for the constraint parameters (from the config).
static xover_cluster(parent1, parent2, mixing_ratio=0.5)

Uniform crossover with one probability test for the mean and covariance together.

Args:
parent1 (Genotype): The first parent. parent2 (Genotype): The second parent. mixing_ratio (float, optional): The probability of mixing the parents. Defaults to 0.5.
Returns:2-element tuple containing the two mixed parents of type Genotype.
Return type:tuple
static xover_genes(parent1, parent2, mixing_ratio=0.5)

Uniform crossover with separate probability tests for the mean and covariance.

Args:
parent1 (Genotype): The first parent. parent2 (Genotype): The second parent. mixing_ratio (float, optional): The probability of mixing the parents. Defaults to 0.5.
Returns:2-element tuple containing the two mixed parents of type Genotype.
Return type:tuple

hawks.io

Handles loading of datasets or previous runs of HAWKS (by creating a BaseGenerator object).

hawks.io.load_datasets(folder_path, glob_filter='*.csv', labels_last_column=True, labels_filename=False, custom_func=None, **kwargs)

Function to load datasets from an external source. The path to the folder is given, and by default all .csvs are used. The labels for the data can be specified as a separate file, or final column of the data.

Any extra kwargs are passed to numpy.loadtxt() , which loads the data in.

Parameters:
  • folder_path (str, pathlib.Path) – The path to the folder that is to be loaded from.
  • glob_filter (str, optional) – Filter to select a subset of files. Defaults to “*.csv”.
  • labels_last_column (bool, optional) – If the labels are in the last column of the data or not. Defaults to True.
  • labels_filename (bool, optional) – If the labels are in a separate file (with ‘labels’ in the filename). Defaults to False.
  • custom_func (Callable, optional) – Custom function for processing the data directly. Must return filenames, datasets, and corresponding labels. Useful for special cases (which is most datasets, as formats are rarely consistent). Defaults to None.
Returns:

A 3-element tuple containing:

filenames (list): A list of the filenames for each loaded file.

datasets (list): A list of the loaded datsets.

label_sets (list): A list of the labels.

Return type:

tuple

hawks.io.load_folder(folder_path)

Creates a BaseGenerator object from a folder (that was previously created by the generator).

Parameters:
  • folder_path (str, pathlib.Path) – The path to the folder that is to be loaded from.
  • glob_filter (str, optional) – Filter to select a subset of files. Defaults to “*.csv”.
Returns:

A generator of the subclass specified in the config.

Return type:

BaseGenerator

hawks.objectives

Defines the Objective class, and its subclasses. Anything that will be used in the fitness function should be implemented here as a relevant class.

Class hierarchy is set up for expansions to more objectives that can be selected from.

class hawks.objectives.Objective

Bases: abc.ABC

Overall wrapper class for the objectives, defining the mandatory methods.

Variables:weight (float) – The objective weight, where -1 is minimization and 1 is maximization.
static eval_objective(indiv)

Evaluates the objective on an individual/solution

classmethod set_kwargs(kwargs_dict)

Used to set the arguments for the objective from the config

classmethod set_objective_attr(indiv, value)

Use setattr with the name of the objective for results saving

class hawks.objectives.ClusterIndex

Bases: hawks.objectives.Objective

For handling shared computation of more cluster indices if that is expanded. There is method to this madness.

class hawks.objectives.Silhouette

Bases: hawks.objectives.ClusterIndex

Class to calculate the silhouette width. See the source code for computation.

Variables:
  • target (float) – The target value of the silhouette width to optimize the datasets towards.
  • method (str, optional) – The method to use for calculating the silhouette width. Either "own" or "sklearn". Defaults to “own”, which is recommended.
static eval_objective(indiv)

Evaluates the objective on an individual/solution

hawks.plotting

Defines all the functions for plotting, allowing easier generation of results. Flexible, general functions remain a constant issue with plotting, so for more complex plots some tweaking may be needed.

Examples of these functions can be found in the Plotting guide.

hawks.plotting.clean_graph(ax, clean_props=None)

Helper function to clean up graphs. Primarily designed for my own usage and preferences, but it should be more broadly useful. This function is called by many of the plots, though it can be called directly if working interactively.

Parameters:
  • ax (matplotlib.axes) – The axis object to be cleaned.
  • clean_props (dict, optional) – A dictionary of options to use. Defaults to None, which uses the defaults specified in this function.
Returns:

The cleaned axis object.

Return type:

matplotlib.axes

hawks.plotting.cluster_alg_ranking(df, significance=0.05, show=False, save_folder=None, filename='alg-ranking')

Produce critical difference (CD) diagrams (see this paper for further details).

Parameters:
  • df (pandas.DataFrame) – The DataFrame with results to plot from.
  • significance (float, optional) – The significance level to use for the statistical test. Defaults to 0.05.
  • show (bool, optional) – Whether to show the plot or not. Defaults to False.
  • save_folder (str, pathlib.Path, optional) – The folder to save the plot in. Defaults to None.
  • filename (str, optional) – The filename to use for saving. Defaults to “alg-ranking”.
hawks.plotting.convergence_plot(stats_df, y='fitness_silhouette', xlabel=None, ylabel=None, cmap='inferno', show=True, fpath=None, legend_type='brief', clean_props=None, ci=None, **kwargs)

Function to show the convergence of the e.g. fitness (though it can be anything). Generic wrapper function for seaborn.lineplot() (with error bars).

Parameters:
  • stats_df (pandas.DataFrame) – The DataFrame with results to plot from.
  • y (str, optional) – The column of the DataFrame for the y-axis. Defaults to “fitness_silhouette”.
  • xlabel (str, optional) – The xlabel to add to the plot. Defaults to None.
  • ylabel (str, optional) – The ylabel to add to the plot. Defaults to None.
  • cmap (str, optional) – A colourmap for the plots. See here for options. Defaults to “viridis”.
  • show (bool, optional) – Whether to show the plot or not. Defaults to False.
  • fpath (str, pathlib.Path, optional) – The location to save the plot. Defaults to None.
  • legend_type (str, optional) – The type of legend to use, governed by seaborn. Defaults to “brief”.
  • clean_props (dict, optional) – The properties to use for cleaning the plot (calls clean_graph()). Defaults to None.
  • ci (int, str, optional) – The confidence interval to use when calculating the estimated error. Can pass ‘sd’ instead to use the standard deviation. Defaults to None.
hawks.plotting.cov_ellipse(cov, q=None, nsig=None)

Creates an ellipse for a given covariance (with a given significance level).

Parameters:
  • cov (numpy.ndarray) – The covariance matrix.
  • q (float, optional) – The amount of variance to account for. Defaults to None.
  • nsig (int, optional) – The number of confidence intervals to use. Defaults to None.
Returns:

A 3-element tuple containing:

width (float): The width of the ellipse.

height (float): The height of the ellipse.

rotation (float): The rotation of the ellipse.

Return type:

tuple

hawks.plotting.create_boxplot(df, x, y, cmap='viridis', xlabel=None, ylabel=None, fpath=None, show=False, fig_format='pdf', clean_props=None, hatching=False, remove_xticks=False, remove_legend=False, **kwargs)

General function for plotting boxplots (wrapper for seaborn.boxplot()).

Parameters:
  • df (pandas.DataFrame) – The DataFrame with results to plot from.
  • x (str) – The name of the column in the DataFrame to use as the x-axis.
  • y (str) – The name of the column in the DataFrame to use as the y-axis.
  • cmap (str, optional) –

    A colourmap for the plots. See here for options. Defaults to “viridis”.

  • xlabel (str, optional) – The xlabel to add to the plot. Defaults to None.
  • ylabel (str, optional) – The ylabel to add to the plot. Defaults to None.
  • fpath (str, pathlib.Path, optional) – The location to save the plot. Defaults to None.
  • show (bool, optional) – Whether to show the plot or not. Defaults to False.
  • fig_format (str, optional) – The file format to save the plot in. Defaults to “pdf”.
  • clean_props (dict, optional) – The properties to use for cleaning the plot (calls clean_graph()). Defaults to None.
  • hatching (bool, optional) – Whether to add hatches to the boxes for better visualization. Defaults to False.
  • remove_xticks (bool, optional) – Whether to remove the tick labels on the x-axis. Defaults to False.
  • remove_legend (bool, optional) – Whether to remove the legend. Defaults to False.
hawks.plotting.instance_space(df, color_highlight, marker_highlight=None, show=True, save_folder=None, seed=None, filename='instance_space', cmap='inferno', legend_type='brief', clean_props=None, plot_data=True, plot_components=False, save_pca=False, feat_label_placement=None, **kwargs)

Function to create the instance space. For information on usage, see Instance space.

Parameters:
  • df (pandas.DataFrame) – The DataFrame with results to plot from.
  • color_highlight (str) – The name of the column in the DataFrame to differentiate using colour.
  • marker_highlight (str, optional) – The name of the column in the DataFrame to differentiate using different markers. Defaults to None.
  • show (bool, optional) – Whether to show the plot or not. Defaults to True.
  • save_folder (str, pathlib.Path, optional) – The folder to save the results in. Defaults to None.
  • seed (int, optional) – The random seed to pass to the clustering algorithms or problem features (if needed). Defaults to None.
  • filename (str, optional) – The filename to save the plot as. Defaults to “instance_space”.
  • cmap (str, optional) –

    A colourmap for the plots. See here for options. Defaults to “inferno”.

  • legend_type (str, optional) – The type of legend to use, governed by seaborn. Defaults to “brief”.
  • clean_props (dict, optional) – The properties to use for cleaning the plot (calls clean_graph()). Defaults to None.
  • plot_data (bool, optional) – Whether the datasets should be plotted or not. Defaults to True.
  • plot_components (bool, optional) – Whether the components of the projection should be plotted or not. Defaults to False.
  • save_pca (bool, optional) – Whether to save the PCA object (for future use). Defaults to False.
  • feat_label_placement (dict, optional) – For customization of the location of the feature labels. Useful when plot_components = True, and the labels need manual tweaking. Defaults to None.
hawks.plotting.plot_cluster(ax, cluster, color, add_patch=True, add_data=True, patch_color=None, hatch=None, pca=None, patch_alpha=0.25, point_size=20)

Function to plot a single cluster.

Parameters:
  • ax (matplotlib.axes) – The axis object to use.
  • cluster (Cluster) – A single cluster to plot.
  • color (list) – The colour to be used for this cluster. The type can vary based on what was used to generate the colours, but must be accepted by matplotlib.
  • add_patch (bool, optional) – Whether to add an ellipse that shows the true cluster boundary. Defaults to True.
  • add_data (bool, optional) – Whether to add the actual data/samples for the cluster. Defaults to True.
  • patch_color (list, optional) – The colour to use for the cluster boundary (must be in an accepted format for matplotlib). Defaults to None.
  • hatch (str, optional) – Whether to hatch the ellipse. Defaults to None.
  • pca (sklearn.decomposition.PCA, optional) – The PCA object, use to transform the data if it exists. Defaults to None.
  • patch_alpha (float, optional) – Transparency of the ellipse. Defaults to 0.25.
  • point_size (int, optional) – Size of the data points. Defaults to 20.
Returns:

The axis object with the cluster added.

Return type:

matplotlib.axes

hawks.plotting.plot_indiv(indiv, ax=None, multiple=False, save=False, show=True, fpath=None, cmap='inferno', fig_format='png', global_seed=None, remove_axis=False, **kwargs)

Function to plot a single individual. Sequentially calls hawks.plotting.plot_cluster(). PCA is applied if the data is more than 2-dimensions.

Parameters:
  • indiv (Genotype) – A single individual to be plotted.
  • ax (matplotlib.axes, optional) – The axis object to use. Defaults to None, where it is created.
  • multiple (bool, optional) – Whether multiple plots are being made (i.e. adding a subplot onto a larger plot). Defaults to False.
  • save (bool, optional) – Whether to save the plot or not. Defaults to False.
  • show (bool, optional) – Whether to show the plot or not. Defaults to True.
  • fpath (str, pathlib.Path, optional) – The path to save the plot in. Defaults to None.
  • cmap (str, optional) –

    A colourmap for the plots. See here for options. Defaults to “inferno”.

  • fig_format (str, optional) – Whether to save the plot as a “png” or as a “pdf”. Defaults to “png”.
  • global_seed (int, optional) – Seed used for PCA if the data is more than 2-dimensions. Defaults to None.
  • remove_axis (bool, optional) – Whether to remove the axis lines (and just show the data). Defaults to False.
hawks.plotting.plot_pop(indivs, nrows=None, ncols=None, fpath=None, cmap='inferno', fig_format='png', global_seed=None, save=False, show=True, remove_axis=False, fig_title=None, **kwargs)

Plotting a population of individuals. Wrapper function for plot_indiv().

The nrows and ncols options allow for specification of the layout of the plotting grid. If not given, it’s made as square as possible.

Parameters:
  • indivs (list) – A list of individuals (Genotype) to be plotted.
  • nrows (int, optional) – Number of rows for the plots. Defaults to None.
  • ncols (int, optional) – Number of columns for the plots. Defaults to None.
  • fpath (str, pathlib.Path, optional) – The path to save the plot in. Defaults to None.
  • cmap (str, optional) –

    A colourmap for the plots. See here for options. Defaults to “inferno”.

  • fig_format (str, optional) – Whether to save the plot as a “png” or as a “pdf”. Defaults to “png”.
  • global_seed (int, optional) – Seed used for PCA if the data is more than 2-dimensions. Defaults to None.
  • save (bool, optional) – Whether to save the plot or not. Defaults to False.
  • show (bool, optional) – Whether to show the plot or not. Defaults to True.
  • remove_axis (bool, optional) – Whether to remove the axis lines (and just show the data). Defaults to False.
  • fig_title (str, optional) – Adds a title to the figure if desired. Defaults to None.
hawks.plotting.save_plot(fig, fpath, fig_format)

Save the given matplotlib.figure.Figure.

Parameters:
  • fig (Figure) – The figure to be saved.
  • fpath (str, pathlib.Path) – The path (full path if a folder other than the working directory is to be used) to save the figure.
  • fig_format (str) – The format of the figure to save in, either ‘png’ or ‘pdf’.
hawks.plotting.scatter_plot(df, x, y, cmap='inferno', show=True, fpath=None, clean_props=None, legend_type='full', **kwargs)

Generic wrapper function for seaborn.scatterplot().

Parameters:
  • df (pandas.DataFrame) – The DataFrame with results to plot from.
  • x (str) – The name of the column in the DataFrame to use as the x-axis.
  • y (str) – The name of the column in the DataFrame to use as the y-axis.
  • cmap (str, optional) –

    A colourmap for the plots. See here for options. Defaults to “inferno”.

  • show (bool, optional) – Whether to show the plot or not. Defaults to True.
  • fpath (str, pathlib.Path, optional) – The location to save the plot. Defaults to None.
  • clean_props (dict, optional) – The properties to use for cleaning the plot (calls clean_graph()). Defaults to None.
  • legend_type (str, optional) – The type of legend to use, governed by seaborn. Defaults to “full”.
hawks.plotting.scatter_prediction(data, preds, seed=None, ax=None, cmap='inferno', show=False, fpath=None, fig_format='png', **kwargs)

Function to plot the predictions (useful for working with external clustering algorithms).

Parameters:
  • data (numpy.ndarray) – The dataset to be plotted.
  • preds (list, or numpy.ndarray) – The predicted labels for the given dataset.
  • seed (int) – The seed used to initialize the RNG. Defaults to None.
  • ax (matplotlib.axes) – The axis object to use (created if not).
  • cmap (str, optional) –

    A colourmap for the plots. See here for options. Defaults to “inferno”.

Returns:

The axis with the plot added.

Return type:

matplotlib.axes

hawks.problem_features

Defines the problem features for use in the analysis and instance space. All functions in this script are scraped via the inspect module. See the source code for implementation details.

Todo

Standardize format with e.g. a wrapper class.

hawks.utils

Functions to help with error handling and generally support everything else.

hawks.utils.df_to_csv(df, path, filename)

Save a pandas.DataFrame as a CSV file.

Parameters:
hawks.utils.get_date()

Used to get get and format current date, to name folders when no name is given.

hawks.utils.get_key_paths(d, key_paths=None, param_lists=None, acc=None)

Used to traverse a config and identify where multiple parameters are given.

Parameters:
  • d (dict) – Config dictionary.
  • key_paths (list) – The list of keys to the relevant part of the config.
  • param_lists (list) – The list of multiple parameters specified in the config.
  • acc (list) – Tracker for the keys.
hawks.utils.set_key_path(d, key_path, v)

Used to set the parameter of a multi-config to a single, given value.

Parameters:
  • d (dict) – Config dictionary.
  • key_path (list) – The list of keys to the relevant part of the config.
  • v – The value to be inserted into the config. The type depends on the value.
hawks.utils.translate_method(input_method)

Removal of whitespace/miscellaneous characters to smooth out method names.

Parameters:input_method (str) – The name of the method to adjust.
Returns:The cleaned method name.
Return type:str