2.2. drrc.analysis module

class AutomaticPostprocessing(root: Path)[source]

Bases: object

This class enables the automatic concatenation of all subtask outputs into a single file called “DataFrame.csv”. Within the respective job directories this still generates a unique path.

Initialise automatic concatenation

Parameters:

root – most shallow path from which to look for job directories, starting from git root

Todo

  • make this work with relative paths from the git root

  • either use a root path or supply a yaml for single job post processing

summary() None[source]

Print summary of the automatic concatenation

Note

This function does not touch any of the files! Its main purpose is debugging / checking that AutomaticConcatenation.auto_concatenate() behaves as expected.

auto_concatenate(Delete=False) None[source]

Automatically concatenate all raw output files

Parameters:

Delete – If True, then the raw output files are deleted after concatenation. Default is False.

The result is then saved with the raw output in a file named DataFrame.csv

_concatenate(path: Path) DataFrame[source]

Concatenate a single cluster job

This takes all numbered .csv DataFrames and concatenates them into one. It checks if the number of lines agrees with what is expected from the config. Else it raises a warning.

Parameters:

path – path to output of cluster job

Returns:

pd.DataFrame which contains all data from a single job

_read_test_csv(path: Path, jobs: list) DataFrame[source]

Read a single csv file, test if it has the expected number of results and return the DataFrame.

Parameters:
  • path – path to csv file

  • jobs – list of number of jobs per task

Returns:

pd.DataFrame which contains all data from a single job

auto_statisticsgeneration() None[source]

Generate statistics from the concatenated DataFrame

_generate_statistics(df: DataFrame) DataFrame[source]

Generate statistics from the concatenated DataFrame

class AnalyseClusterRunBase(conf: Config)[source]

Bases: ABC

Base class for all analysis types we will run later.

This includes all basic functionality that will be used later, such as IO.

Initialize a cluster run object for analysis

Parameters:

conf (Config) – A config object that has previously been run on the cluster.

abstractmethod process() DataFrame[source]

Read output of cluster run as defined by self.conf

abstractmethod save() None[source]

Save processed data in a file

class HyperPostProcessing(conf: Config, data_name: str = 'score_', data_type: str = '.txt', num_cores: int = 2)[source]

Bases: AnalyseClusterRunBase

Post-processing of our hyperparameter scans

This class is meant to take in raw data and generate the desired DataFrame

Initialize post-processing

Important

This class assumes that output files contain a numerical (1-based) index between data_name and data_type!

Parameters:
  • conf (Config) – Config of the corresponding clusterrun

  • data_dir (strPath) – Path to raw data

  • out_dir (Path) – Path for saving dataframe

  • data_name (str) – File name, e.g. "score_"

  • data_type (str) – File extension, e.g. ".txt"

  • num_cores (int) – Number of cores available for processing

_get_raw_file(i: int) Path[source]

Generate filename for the n-th raw data file

Parameters:

i (int) – Number of the file using zero-based indexing

Returns:

Path to requested file

Return type:

(str)

Todo

Make this method return a Path instead of str

_generate_statistics(i: int, params: dict)[source]

Internal function to generate the statistics of self.process()

Warning

This function will not return any statistics if the numpy array contains any NaNs. Numpy has functions for this, but right now we don’t use them.

process() DataFrame[source]

Extract Validtime data and seeds from .txt files stored in conf['Saving']['OutputDirectory'] into a pd.DataFrame.

To accelerate pre-processing of raw data this function runs on half the system’s cores by default.

Parameters:
  • conf (Config) – Config with ClusterRun information.

  • DataDir (Path) – Passed to Data, might deviate from Path in conf depending on user.

  • DataName (str) – Name of files with Validtime data up to iterator number. Default score_.

  • DataType (str) – File type of files with Validtime data. Default .txt.

  • NumberSeeds (int) – Number of seeds/ different networks drawn. Default 10.

  • num_cores (int) – Number of CPU cores to use for the preprocessing. Default is half the system’s cores.

Returns:

Each row corresponds to a single set of hyperparameters. Statistics are given for the following:

mean_t

Mean valid time over all executions of the hyperparameter set

std_t

Standard variance of mean_t over all executions of the hyperparameter set

max_t

Maximum valid time over all executions of the hyperparameter set

avg_std_seed

Average standard deviation per training seed

avg_std_data

Average standard deviation per training dataset

Return type:

(pd.DataFrame)

save() None[source]

Save dataframe to csv

fix_raw_data() None[source]

Fix bad output raw data and save to new file with prefix fn_mod

This function also modifies the expected filename.

Parameters:

fn_mod (str) – Modified prefix of the input file. Default is "rf", such that, e.g. "score_0.txt" --> "rf-score_0.txt"

_reformat_single_file(file_index: int)[source]

Takes a filename and creates the reformatted file.

Parameters:

file_index (int) – Index of the file to be reformatted, starting at 0.