2.2. drrc.analysis module
- class AutomaticPostprocessing(root: Path)[source]
Bases:
object
This class enables the automatic concatenation of all subtask outputs into a single file called “DataFrame.csv”. Within the respective job directories this still generates a unique path.
Initialise automatic concatenation
- Parameters:
root – most shallow path from which to look for job directories, starting from git root
Todo
make this work with relative paths from the git root
either use a root path or supply a yaml for single job post processing
- summary() None [source]
Print summary of the automatic concatenation
Note
This function does not touch any of the files! Its main purpose is debugging / checking that
AutomaticConcatenation.auto_concatenate()
behaves as expected.
- auto_concatenate(Delete=False) None [source]
Automatically concatenate all raw output files
- Parameters:
Delete – If True, then the raw output files are deleted after concatenation. Default is False.
The result is then saved with the raw output in a file named
DataFrame.csv
- _concatenate(path: Path) DataFrame [source]
Concatenate a single cluster job
This takes all numbered
.csv
DataFrames and concatenates them into one. It checks if the number of lines agrees with what is expected from the config. Else it raises a warning.- Parameters:
path – path to output of cluster job
- Returns:
pd.DataFrame
which contains all data from a single job
- class AnalyseClusterRunBase(conf: Config)[source]
Bases:
ABC
Base class for all analysis types we will run later.
This includes all basic functionality that will be used later, such as IO.
Initialize a cluster run object for analysis
- Parameters:
conf (Config) – A config object that has previously been run on the cluster.
- class HyperPostProcessing(conf: Config, data_name: str = 'score_', data_type: str = '.txt', num_cores: int = 2)[source]
Bases:
AnalyseClusterRunBase
Post-processing of our hyperparameter scans
This class is meant to take in raw data and generate the desired DataFrame
Initialize post-processing
Important
This class assumes that output files contain a numerical (1-based) index between
data_name
anddata_type
!- Parameters:
- _get_raw_file(i: int) Path [source]
Generate filename for the n-th raw data file
- Parameters:
i (int) – Number of the file using zero-based indexing
- Returns:
Path to requested file
- Return type:
(str)
Todo
Make this method return a Path instead of str
- _generate_statistics(i: int, params: dict)[source]
Internal function to generate the statistics of
self.process()
Warning
This function will not return any statistics if the numpy array contains any NaNs. Numpy has functions for this, but right now we don’t use them.
- process() DataFrame [source]
Extract Validtime data and seeds from
.txt
files stored inconf['Saving']['OutputDirectory']
into apd.DataFrame
.To accelerate pre-processing of raw data this function runs on half the system’s cores by default.
- Parameters:
conf (Config) –
Config
with ClusterRun information.DataDir (Path) – Passed to Data, might deviate from Path in conf depending on user.
DataName (str) – Name of files with Validtime data up to iterator number. Default
score_
.DataType (str) – File type of files with Validtime data. Default
.txt
.NumberSeeds (int) – Number of seeds/ different networks drawn. Default 10.
num_cores (int) – Number of CPU cores to use for the preprocessing. Default is half the system’s cores.
- Returns:
Each row corresponds to a single set of hyperparameters. Statistics are given for the following:
mean_t
Mean valid time over all executions of the hyperparameter set
std_t
Standard variance of
mean_t
over all executions of the hyperparameter setmax_t
Maximum valid time over all executions of the hyperparameter set
avg_std_seed
Average standard deviation per training seed
avg_std_data
Average standard deviation per training dataset
- Return type:
(pd.DataFrame)