hub / github.com/MaartenGr/BERTopic / SklearnEmbedder

Class SklearnEmbedder

bertopic/backend/_sklearn.py:5–68 · view source on GitHub ↗

Scikit-Learn based embedding model. This component allows the usage of scikit-learn pipelines for generating document and word embeddings. Arguments: pipe: A scikit-learn pipeline that can `.transform()` text. Examples: Scikit-Learn is very flexible and it allows for m

Source from the content-addressed store, hash-verified

3
4
5	class SklearnEmbedder(BaseEmbedder):
6	"""Scikit-Learn based embedding model.
7
8	This component allows the usage of scikit-learn pipelines for generating document and
9	word embeddings.
10
11	Arguments:
12	pipe: A scikit-learn pipeline that can `.transform()` text.
13
14	Examples:
15	Scikit-Learn is very flexible and it allows for many representations.
16	A relatively simple pipeline is shown below.
17
18	```python
19	from sklearn.pipeline import make_pipeline
20	from sklearn.decomposition import TruncatedSVD
21	from sklearn.feature_extraction.text import TfidfVectorizer
22
23	from bertopic.backend import SklearnEmbedder
24
25	pipe = make_pipeline(
26	TfidfVectorizer(),
27	TruncatedSVD(100)
28	)
29
30	sklearn_embedder = SklearnEmbedder(pipe)
31	topic_model = BERTopic(embedding_model=sklearn_embedder)
32	```
33
34	This pipeline first constructs a sparse representation based on TF/idf and then
35	makes it dense by applying SVD. Alternatively, you might also construct something
36	more elaborate. As long as you construct a scikit-learn compatible pipeline, you
37	should be able to pass it to Bertopic.
38
39	!!! Warning
40	One caveat to be aware of is that scikit-learns base `Pipeline` class does not
41	support the `.partial_fit()`-API. If you have a pipeline that theoretically should
42	be able to support online learning then you might want to explore
43	the [scikit-partial](https://github.com/koaning/scikit-partial) project.
44	"""
45
46	def __init__(self, pipe):
47	super().__init__()
48	self.pipe = pipe
49
50	def embed(self, documents, verbose=False):
51	"""Embed a list of n documents/words into an n-dimensional
52	matrix of embeddings.
53
54	Arguments:
55	documents: A list of documents or words to be embedded
56	verbose: No-op variable that's kept around to keep the API consistent. If you want to get feedback on training times, you should use the sklearn API.
57
58	Returns:
59	Document/words embeddings with shape (n, m) with `n` documents/words
60	that each have an embeddings size of `m`
61	"""
62	try:

Callers 1

select_backendFunction · 0.90

Calls

no outgoing calls

Tested by

no test coverage detected