Holoclean

Noisy and erroneous data is a major bottleneck in analytics. Data cleaning and repairing account for about 60% of the work of data scientists. To address this bottleneck, we recently introduced HoloClean, a semi-automated data repairing framework that relies on statistical learning and inference to repair errors in structured data. In HoloClean, we build upon the paradigm of weak supervision and demonstrate how to leverage diverse signals, including user-defined heuristic rules (such as generalized data integrity constraints) to repair erroneous data.

Holoclean and Session class

class holoclean.holoclean.HoloClean(**kwargs)[source]

Main entry point for HoloClean which creates a HoloClean Data Engine and initializes Spark.

class holoclean.holoclean.Session(holo_env, name='session')[source]

Session class controls the entire pipeline of HoloClean

add_denial_constraint(dc)[source]

Adds denial constraints from string into self.Denial_constraints

Parameters:dc – string in dc format
Returns:string array of dc’s
compare_to_truth(truth_path)[source]

Compares our repaired set to the truth prints precision and recall

Parameters:truth_path – path to clean version of dataset
detect_errors(detector_list)[source]

Separates cells that violate DC’s from those that don’t

Parameters:detector_list – List of error detectors
Returns:clean dataframe
Returns:don’t know dataframe
load_clean_data(file_path)[source]

Loads pre-defined clean cells from csv

Parameters:file_path – path to file
Returns:spark dataframe of clean cells
load_data(file_path)[source]

Loads a dataset from file into the database

Parameters:file_path – path to data file
Returns:pyspark dataframe
load_denial_constraints(file_path)[source]

Loads denial constraints from line-separated txt file

Parameters:file_path – path to dc file
Returns:string array of dc’s
load_dirty_data(file_path)[source]

Loads pre-defined dirty cells from csv

Parameters:file_path – path to file
Returns:spark dataframe of dirty cells
remove_denial_constraint(index)[source]

Removing the denial constraint at index

Parameters:index – index in list
Returns:string array of dc’s
repair()[source]

Repairs the initial data includes pruning, featurization, and softmax

Returns:repaired dataset