I need to create a custom transformer to be input into a grader.

The grader passes a list of dictionaries to the predict or predict_proba method of my estimator, not a DataFrame. This means that the model must work with both data types. For this reason, I need to provide a custom ColumnSelectTransformer to use instead scikit-learn's own ColumnTransformer.

This is my code for the custom transformer that aims to drop null values in the columns provided.


class ColumnSelectTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        return X[self.columns].values()

simple_features = Pipeline([
    ('cst', ColumnSelectTransformer(simple_cols)),

However, I am unable to pass the following assertion tests

assert data['RESTOT'].isnull().sum() > 0
assert not np.isnan(simple_features.fit_transform(data)).any()

I generate a typeerror

TypeError                                 Traceback (most recent call last)
<ipython-input-44-922f08231b1f> in <module>()
      1 assert not data['RESTOT'].isnull().sum() > 0
----> 2 assert not np.isnan(simple_features.fit_transform(data)).any()

/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
    391                 return Xt
    392             if hasattr(last_step, 'fit_transform'):
--> 393                 return last_step.fit_transform(Xt, y, **fit_params)
    394             else:
    395                 return last_step.fit(Xt, y, **fit_params).transform(Xt)

/opt/conda/lib/python3.7/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params)
    551         if y is None:
    552             # fit method of arity 1 (unsupervised transformation)
--> 553             return self.fit(X, **fit_params).transform(X)
    554         else:
    555             # fit method of arity 2 (supervised transformation)

<ipython-input-42-e20ea4310864> in transform(self, X)
     12             X = pd.DataFrame(X)
     13         X.dropna(inplace=True)
---> 14         return X[self.columns].values()
     16 simple_features = Pipeline([

TypeError: 'numpy.ndarray' object is not callable

Here is the actual data if anyone wants access.

mkdir data
wget http://dataincubator-wqu.s3.amazonaws.com/mldata/providers-train.csv -nc -P ./ml-data
wget http://dataincubator-wqu.s3.amazonaws.com/mldata/providers-metadata.csv -nc -P ./ml-data

data = pd.read_csv('./ml-data/providers-train.csv', encoding='latin1')

Related posts

Recent Viewed