MCPcopy
hub / github.com/ddbourgin/numpy-ml / plot_gt_freqs

Function plot_gt_freqs

numpy_ml/plots/ngram_plots.py:80–115  ·  view source on GitHub ↗

Draws a scatterplot of the empirical frequencies of the counted species versus their Simple Good Turing smoothed values, in rank order. Depends on pylab and matplotlib.

(fp)

Source from the content-addressed store, hash-verified

78
79
80def plot_gt_freqs(fp):
81 """
82 Draws a scatterplot of the empirical frequencies of the counted species
83 versus their Simple Good Turing smoothed values, in rank order. Depends on
84 pylab and matplotlib.
85 """
86 MLE = MLENGram(1, filter_punctuation=False, filter_stopwords=False)
87 MLE.train(fp, encoding="utf-8-sig")
88 counts = dict(MLE.counts[1])
89
90 GT = GoodTuringNGram(1, filter_stopwords=False, filter_punctuation=False)
91 GT.train(fp, encoding="utf-8-sig")
92
93 ADD = AdditiveNGram(1, 1, filter_punctuation=False, filter_stopwords=False)
94 ADD.train(fp, encoding="utf-8-sig")
95
96 tot = float(sum(counts.values()))
97 freqs = dict([(token, cnt / tot) for token, cnt in counts.items()])
98 sgt_probs = dict([(tok, np.exp(GT.log_prob(tok, 1))) for tok in counts.keys()])
99 as_probs = dict([(tok, np.exp(ADD.log_prob(tok, 1))) for tok in counts.keys()])
100
101 X, Y = np.arange(len(freqs)), sorted(freqs.values(), reverse=True)
102 plt.loglog(X, Y, "k+", alpha=0.25, label="MLE")
103
104 X, Y = np.arange(len(sgt_probs)), sorted(sgt_probs.values(), reverse=True)
105 plt.loglog(X, Y, "r+", alpha=0.25, label="simple Good-Turing")
106
107 X, Y = np.arange(len(as_probs)), sorted(as_probs.values(), reverse=True)
108 plt.loglog(X, Y, "b+", alpha=0.25, label="Laplace smoothing")
109
110 plt.xlabel("Rank")
111 plt.ylabel("Probability")
112 plt.legend()
113 plt.tight_layout()
114 plt.savefig("img/rank_probs.png")
115 plt.close("all")

Callers

nothing calls this directly

Calls 7

trainMethod · 0.95
log_probMethod · 0.95
log_probMethod · 0.95
MLENGramClass · 0.90
GoodTuringNGramClass · 0.90
AdditiveNGramClass · 0.90
trainMethod · 0.45

Tested by

no test coverage detected