Error detection

HoloClean learns to clean data by first splitting it into two categories clean and dont_know or dk for short. We’ve provided two kinds of error detectors, the SqlDCErrorDetection which uses Denial Constraints to make these splits and the SqlnullErrorDetection which labels all the cells that have null values as don’t know cells. HoloClean also support custom or user-defined error detection methods by creating a new class that inherits from ErrorDetection and overrides the required methods.

ErrorDetector

class holoclean.errordetection.errordetector.ErrorDetection(holo_obj, dataset)[source]
This class is an abstract class for general error_detection ,
it requires for every sub-class to implement the

get_clean_cells and get_noisy_cells method

get_clean_cells()[source]
This method creates a dataframe which has the information (index,attribute) for the clean_cells

:return dataframe for the clean_cells

get_noisy_cells()[source]
This method creates a dataframe which has the information (index,attribute) for the dk_cells

:return dataframe for the dk_cell

SqlDCErrorDetection

class holoclean.errordetection.sql_dcerrordetector.SqlDCErrorDetection(session)[source]

This class is a subclass of ErrorDetection class and will returns don’t know cells and clean cells based on the denial constraints

get_clean_cells()[source]
Returns a dataframe that consists of index of clean cells index,
attribute
Returns:spark dataframe
get_noisy_cells()[source]
Returns a dataframe that consists of index of noisy cells index,
attribute
Returns:spark_dataframe

SqlnullErrorDetection

class holoclean.errordetection.sql_dcerrordetector.SqlDCErrorDetection(session)[source]

This class is a subclass of ErrorDetection class and will returns don’t know cells and clean cells based on the denial constraints

get_clean_cells()[source]
Returns a dataframe that consists of index of clean cells index,
attribute
Returns:spark dataframe
get_noisy_cells()[source]
Returns a dataframe that consists of index of noisy cells index,
attribute
Returns:spark_dataframe