Holoclean¶
Noisy and erroneous data is a major bottleneck in analytics. Data cleaning and repairing account for about 60% of the work of data scientists. To address this bottleneck, we recently introduced HoloClean, a semi-automated data repairing framework that relies on statistical learning and inference to repair errors in structured data. In HoloClean, we build upon the paradigm of weak supervision and demonstrate how to leverage diverse signals, including user-defined heuristic rules (such as generalized data integrity constraints) to repair erroneous data.
Holoclean and Session class¶
-
class
holoclean.holoclean.
HoloClean
(**kwargs)[source]¶ Main entry point for HoloClean which creates a HoloClean Data Engine and initializes Spark.
-
class
holoclean.holoclean.
Session
(holo_env, name='session')[source]¶ Session class controls the entire pipeline of HoloClean
-
add_denial_constraint
(dc)[source]¶ Adds denial constraints from string into self.Denial_constraints
Parameters: dc – string in dc format Returns: string array of dc’s
-
compare_to_truth
(truth_path)[source]¶ Compares our repaired set to the truth prints precision and recall
Parameters: truth_path – path to clean version of dataset
-
detect_errors
(detector_list)[source]¶ Separates cells that violate DC’s from those that don’t
Parameters: detector_list – List of error detectors Returns: clean dataframe Returns: don’t know dataframe
-
load_clean_data
(file_path)[source]¶ Loads pre-defined clean cells from csv
Parameters: file_path – path to file Returns: spark dataframe of clean cells
-
load_data
(file_path)[source]¶ Loads a dataset from file into the database
Parameters: file_path – path to data file Returns: pyspark dataframe
-
load_denial_constraints
(file_path)[source]¶ Loads denial constraints from line-separated txt file
Parameters: file_path – path to dc file Returns: string array of dc’s
-
load_dirty_data
(file_path)[source]¶ Loads pre-defined dirty cells from csv
Parameters: file_path – path to file Returns: spark dataframe of dirty cells
-