MCPcopy
hub / github.com/MaartenGr/BERTopic / SklearnEmbedder

Class SklearnEmbedder

bertopic/backend/_sklearn.py:5–68  ·  view source on GitHub ↗

Scikit-Learn based embedding model. This component allows the usage of scikit-learn pipelines for generating document and word embeddings. Arguments: pipe: A scikit-learn pipeline that can `.transform()` text. Examples: Scikit-Learn is very flexible and it allows for m

Source from the content-addressed store, hash-verified

3
4
5class SklearnEmbedder(BaseEmbedder):
6 """Scikit-Learn based embedding model.
7
8 This component allows the usage of scikit-learn pipelines for generating document and
9 word embeddings.
10
11 Arguments:
12 pipe: A scikit-learn pipeline that can `.transform()` text.
13
14 Examples:
15 Scikit-Learn is very flexible and it allows for many representations.
16 A relatively simple pipeline is shown below.
17
18 ```python
19 from sklearn.pipeline import make_pipeline
20 from sklearn.decomposition import TruncatedSVD
21 from sklearn.feature_extraction.text import TfidfVectorizer
22
23 from bertopic.backend import SklearnEmbedder
24
25 pipe = make_pipeline(
26 TfidfVectorizer(),
27 TruncatedSVD(100)
28 )
29
30 sklearn_embedder = SklearnEmbedder(pipe)
31 topic_model = BERTopic(embedding_model=sklearn_embedder)
32 ```
33
34 This pipeline first constructs a sparse representation based on TF/idf and then
35 makes it dense by applying SVD. Alternatively, you might also construct something
36 more elaborate. As long as you construct a scikit-learn compatible pipeline, you
37 should be able to pass it to Bertopic.
38
39 !!! Warning
40 One caveat to be aware of is that scikit-learns base `Pipeline` class does not
41 support the `.partial_fit()`-API. If you have a pipeline that theoretically should
42 be able to support online learning then you might want to explore
43 the [scikit-partial](https://github.com/koaning/scikit-partial) project.
44 """
45
46 def __init__(self, pipe):
47 super().__init__()
48 self.pipe = pipe
49
50 def embed(self, documents, verbose=False):
51 """Embed a list of n documents/words into an n-dimensional
52 matrix of embeddings.
53
54 Arguments:
55 documents: A list of documents or words to be embedded
56 verbose: No-op variable that's kept around to keep the API consistent. If you want to get feedback on training times, you should use the sklearn API.
57
58 Returns:
59 Document/words embeddings with shape (n, m) with `n` documents/words
60 that each have an embeddings size of `m`
61 """
62 try:

Callers 1

select_backendFunction · 0.90

Calls

no outgoing calls

Tested by

no test coverage detected