openclean.data.load module

Collection of helper methods to load a dataset from CSV files.

openclean.data.load.dataset(filename: str, header: Optional[List[Union[str, histore.document.schema.Column]]] = None, delim: Optional[str] = None, compressed: Optional[bool] = None, typecast: Optional[openclean.profiling.datatype.convert.DatatypeConverter] = None, none_is: Optional[str] = None, encoding: Optional[str] = None) → pandas.core.frame.DataFrame

Read a pandas data frame from a CSV file. This function infers the CSV file delimiter and compression from the file name (if not specified). By now the inference follows a very basic pattern. Files that have ‘.tsv’ (or ‘.tsv.gz’) as their suffix are expected to be tab-delimited. Files that end with ‘.gz’ are expected to be gzip compressed.

Returns a pandas DataFrame where the column names are instances of the identifiable Column class used by openclean.

Parameters

filename (string) – Path to the CSV file that is being read.
header (list of string, default=None) – Optional header. If no header is given it is assumed that the first row in the CSV file contains the header information.
delim (string, default=None) – The column delimiter used in the CSV file.
compressed (bool, default=None) – Flag indicating if the file contents have been compressed using gzip.
typecast (openclean.profiling.datatype.convert.DatatypeConverter,) – default=None Optional type cnverter that is applied to all data rows.
none_is (string, default=None) – String that was used to encode None values in the input file. If given, all cell values that match the given string are substituted by None.
encoding (string, default=None) – The csv file encoding e.g. utf-8, utf16 etc

Return type

pd.DataFrame