Scikit-Learn based embedding model. This component allows the usage of scikit-learn pipelines for generating document and word embeddings. Arguments: pipe: A scikit-learn pipeline that can `.transform()` text. Examples: Scikit-Learn is very flexible and it allows for m
| 3 | |
| 4 | |
| 5 | class SklearnEmbedder(BaseEmbedder): |
| 6 | """Scikit-Learn based embedding model. |
| 7 | |
| 8 | This component allows the usage of scikit-learn pipelines for generating document and |
| 9 | word embeddings. |
| 10 | |
| 11 | Arguments: |
| 12 | pipe: A scikit-learn pipeline that can `.transform()` text. |
| 13 | |
| 14 | Examples: |
| 15 | Scikit-Learn is very flexible and it allows for many representations. |
| 16 | A relatively simple pipeline is shown below. |
| 17 | |
| 18 | ```python |
| 19 | from sklearn.pipeline import make_pipeline |
| 20 | from sklearn.decomposition import TruncatedSVD |
| 21 | from sklearn.feature_extraction.text import TfidfVectorizer |
| 22 | |
| 23 | from bertopic.backend import SklearnEmbedder |
| 24 | |
| 25 | pipe = make_pipeline( |
| 26 | TfidfVectorizer(), |
| 27 | TruncatedSVD(100) |
| 28 | ) |
| 29 | |
| 30 | sklearn_embedder = SklearnEmbedder(pipe) |
| 31 | topic_model = BERTopic(embedding_model=sklearn_embedder) |
| 32 | ``` |
| 33 | |
| 34 | This pipeline first constructs a sparse representation based on TF/idf and then |
| 35 | makes it dense by applying SVD. Alternatively, you might also construct something |
| 36 | more elaborate. As long as you construct a scikit-learn compatible pipeline, you |
| 37 | should be able to pass it to Bertopic. |
| 38 | |
| 39 | !!! Warning |
| 40 | One caveat to be aware of is that scikit-learns base `Pipeline` class does not |
| 41 | support the `.partial_fit()`-API. If you have a pipeline that theoretically should |
| 42 | be able to support online learning then you might want to explore |
| 43 | the [scikit-partial](https://github.com/koaning/scikit-partial) project. |
| 44 | """ |
| 45 | |
| 46 | def __init__(self, pipe): |
| 47 | super().__init__() |
| 48 | self.pipe = pipe |
| 49 | |
| 50 | def embed(self, documents, verbose=False): |
| 51 | """Embed a list of n documents/words into an n-dimensional |
| 52 | matrix of embeddings. |
| 53 | |
| 54 | Arguments: |
| 55 | documents: A list of documents or words to be embedded |
| 56 | verbose: No-op variable that's kept around to keep the API consistent. If you want to get feedback on training times, you should use the sklearn API. |
| 57 | |
| 58 | Returns: |
| 59 | Document/words embeddings with shape (n, m) with `n` documents/words |
| 60 | that each have an embeddings size of `m` |
| 61 | """ |
| 62 | try: |