How to use the snorkel.labeling.LFAnalysis function in snorkel

To help you get started, we’ve selected a few snorkel examples, based on popular ways it is used in public projects.

Secure your code as it's written. Use Snyk Code to scan source code in minutes - no build needed - and fix issues immediately.

github snorkel-team / snorkel / test / synthetic / test_synthetic_data.py View on Github external
"""Test generated label matrix L for consistency with P, Y.

        This tests for consistency between the true conditional LF probabilities, P,
        and the empirical ones computed from L and Y, where P, L, and Y are generated
        by the generate_simple_label_matrix function.

        Parameters
        ----------
        k
            Cardinality
        decimal
            Number of decimals to check element-wise error, err < 1.5 * 10**(-decimal)
        """
        np.random.seed(123)
        P, Y, L = generate_simple_label_matrix(self.n, self.m, k)
        P_emp = LFAnalysis(L).lf_empirical_probs(Y, k=k)
        np.testing.assert_array_almost_equal(P, P_emp, decimal=decimal)
github snorkel-team / snorkel / test / labeling / test_analysis.py View on Github external
def setUp(self) -> None:
        self.lfa = LFAnalysis(np.array(L))
        self.lfa_wo_abstain = LFAnalysis(np.array(L_wo_abstain))
        self.Y = np.array(Y)
github snorkel-team / snorkel / test / labeling / test_analysis.py View on Github external
def setUp(self) -> None:
        self.lfa = LFAnalysis(np.array(L))
        self.lfa_wo_abstain = LFAnalysis(np.array(L_wo_abstain))
        self.Y = np.array(Y)
github snorkel-team / snorkel / test / labeling / test_analysis.py View on Github external
df = self.lfa.lf_summary(Y=None, est_weights=None)
        df_expected = pd.DataFrame(
            {
                "Polarity": [[1, 2], [], [0, 2], [2], [0, 1], [0]],
                "Coverage": [3 / 6, 0, 3 / 6, 2 / 6, 2 / 6, 4 / 6],
                "Overlaps": [3 / 6, 0, 3 / 6, 1 / 6, 2 / 6, 4 / 6],
                "Conflicts": [3 / 6, 0, 2 / 6, 1 / 6, 2 / 6, 3 / 6],
            }
        )
        pd.testing.assert_frame_equal(df.round(6), df_expected.round(6))

        est_weights = [1, 0, 1, 1, 1, 0.5]
        names = list("abcdef")
        lfs = [LabelingFunction(s, f) for s in names]
        lfa = LFAnalysis(np.array(L), lfs)
        df = lfa.lf_summary(self.Y, est_weights=est_weights)
        df_expected = pd.DataFrame(
            {
                "j": [0, 1, 2, 3, 4, 5],
                "Polarity": [[1, 2], [], [0, 2], [2], [0, 1], [0]],
                "Coverage": [3 / 6, 0, 3 / 6, 2 / 6, 2 / 6, 4 / 6],
                "Overlaps": [3 / 6, 0, 3 / 6, 1 / 6, 2 / 6, 4 / 6],
                "Conflicts": [3 / 6, 0, 2 / 6, 1 / 6, 2 / 6, 3 / 6],
                "Correct": [1, 0, 1, 1, 1, 2],
                "Incorrect": [2, 0, 2, 1, 1, 2],
                "Emp. Acc.": [1 / 3, 0, 1 / 3, 1 / 2, 1 / 2, 2 / 4],
                "Learned Weight": [1, 0, 1, 1, 1, 0.5],
            }
        ).set_index(pd.Index(names))
        pd.testing.assert_frame_equal(df.round(6), df_expected.round(6))
github snorkel-team / snorkel-tutorials / spouse / spouse_demo.py View on Github external
lf_familial_relationship,
    lf_family_left_window,
    lf_other_relationship,
    lf_distant_supervision,
    lf_distant_supervision_last_names,
]
applier = PandasLFApplier(lfs)

# %% {"tags": ["md-exclude-output"]}
from snorkel.labeling import LFAnalysis

L_dev = applier.apply(df_dev)
L_train = applier.apply(df_train)

# %%
LFAnalysis(L_dev, lfs).lf_summary(Y_dev)

# %% [markdown]
# ### Training the Label Model
#
# Now, we'll train a model of the LFs to estimate their weights and combine their outputs. Once the model is trained, we can combine the outputs of the LFs into a single, noise-aware training label set for our extractor.

# %% {"tags": ["md-exclude-output"]}
from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, Y_dev, n_epochs=5000, log_freq=500, seed=12345)

# %% [markdown]
# ### Label Model Metrics
# Since our dataset is highly unbalanced (91% of the labels are negative), even a trivial baseline that always outputs negative can get a high accuracy. So we evaluate the label model using the F1 score and ROC-AUC rather than accuracy.
github snorkel-team / snorkel-tutorials / crowdsourcing / crowdsourcing_tutorial.py View on Github external
# %% [markdown]
# Note that because our dev set is so small and our LFs are relatively sparse, many LFs will appear to have zero coverage.
# Fortunately, our label model learns weights for LFs based on their outputs on the training set, which is generally much larger.

# %%
from snorkel.labeling import LFAnalysis

LFAnalysis(L_dev, worker_lfs).lf_summary(Y_dev).sample(5)

# %% [markdown]
# So the crowd labels in general are quite good! But how much of our dev and training
# sets do they cover?

# %%
print(f"Training set coverage: {100 * LFAnalysis(L_train).label_coverage(): 0.1f}%")
print(f"Dev set coverage: {100 * LFAnalysis(L_dev).label_coverage(): 0.1f}%")

# %% [markdown]
# ### Additional labeling functions
#
# To improve coverage of the training set, we can mix the crowdworker labeling functions with labeling
# functions of other types.
# For example, we can use [TextBlob](https://textblob.readthedocs.io/en/dev/index.html), a tool that provides a pretrained sentiment analyzer. We run TextBlob on our tweets and create some simple LFs that threshold its polarity score, similar to what we did in the spam_tutorial.

# %%
from snorkel.labeling import labeling_function
from snorkel.preprocess import preprocessor
from textblob import TextBlob


@preprocessor(memoize=True)
github snorkel-team / snorkel-tutorials / spam / 01_spam_tutorial.py View on Github external
@labeling_function(pre=[textblob_sentiment])
def textblob_subjectivity(x):
    return HAM if x.subjectivity >= 0.5 else ABSTAIN


# %% [markdown]
# Let's apply our LFs so we can analyze their performance.

# %% {"tags": ["md-exclude-output"]}
lfs = [textblob_polarity, textblob_subjectivity]

applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)

# %%
LFAnalysis(L_train, lfs).lf_summary()

# %% [markdown]
# **Again, these LFs aren't perfect—note that the `textblob_subjectivity` LF has fairly high coverage and could have a high rate of false positives. We'll rely on Snorkel's `LabelModel` to estimate the labeling function accuracies and reweight and combine their outputs accordingly.**

# %% [markdown]
# ## 3. Writing More Labeling Functions

# %% [markdown]
# If a single LF had high enough coverage to label our entire test dataset accurately, then we wouldn't need a classifier at all.
# We could just use that single simple heuristic to complete the task.
# But most problems are not that simple.
# Instead, we usually need to **combine multiple LFs** to label our dataset, both to increase the size of the generated training set (since we can't generate training labels for data points that no LF voted on) and to improve the overall accuracy of the training labels we generate by factoring in multiple different signals.
#
# In the following sections, we'll show just a few of the many types of LFs that you could write to generate a training dataset for this problem.

# %% [markdown]
github snorkel-team / snorkel-tutorials / spam / 01_spam_tutorial.py View on Github external
@labeling_function()
def regex_check_out(x):
    return SPAM if re.search(r"check.*out", x.text, flags=re.I) else ABSTAIN


# %% [markdown]
# Again, let's generate our label matrices and see how we do.

# %% {"tags": ["md-exclude-output"]}
lfs = [check_out, check, regex_check_out]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)

# %%
LFAnalysis(L=L_train, lfs=lfs).lf_summary()

# %% [markdown]
# We've split the difference in `train` set coverage—this looks promising!
# Let's verify that we corrected our false positive from before.

# %% [markdown]
# To understand the coverage difference between `check` and `regex_check_out`, let's take a look at 10 data points from the `train` set.
# Remember: coverage isn't always good.
# Adding false positives will increase coverage.

# %%
buckets = get_label_buckets(L_train[:, 1], L_train[:, 2])
df_train.iloc[buckets[(SPAM, ABSTAIN)]].sample(10, random_state=1)

# %% [markdown]
# Most of these are SPAM, but a good number are false positives.