MCPcopy
hub / github.com/MaartenGr/BERTopic / __init__

Method __init__

bertopic/_bertopic.py:145–315  ·  view source on GitHub ↗

BERTopic initialization. Arguments: language: The main language used in your documents. The default sentence-transformers model for "english" is `all-MiniLM-L6-v2`. For a full overview of supported languages see bertopic.backend.langua

(
        self,
        language: str = "english",
        top_n_words: int = 10,
        n_gram_range: Tuple[int, int] = (1, 1),
        min_topic_size: int = 10,
        nr_topics: Union[int, str] | None = None,
        low_memory: bool = False,
        calculate_probabilities: bool = False,
        seed_topic_list: List[List[str]] | None = None,
        zeroshot_topic_list: List[str] | None = None,
        zeroshot_min_similarity: float = 0.7,
        embedding_model=None,
        umap_model=None,
        hdbscan_model=None,
        vectorizer_model: CountVectorizer = None,
        ctfidf_model: TfidfTransformer = None,
        representation_model: BaseRepresentation = None,
        verbose: bool = False,
    )

Source from the content-addressed store, hash-verified

143 """
144
145 def __init__(
146 self,
147 language: str = "english",
148 top_n_words: int = 10,
149 n_gram_range: Tuple[int, int] = (1, 1),
150 min_topic_size: int = 10,
151 nr_topics: Union[int, str] | None = None,
152 low_memory: bool = False,
153 calculate_probabilities: bool = False,
154 seed_topic_list: List[List[str]] | None = None,
155 zeroshot_topic_list: List[str] | None = None,
156 zeroshot_min_similarity: float = 0.7,
157 embedding_model=None,
158 umap_model=None,
159 hdbscan_model=None,
160 vectorizer_model: CountVectorizer = None,
161 ctfidf_model: TfidfTransformer = None,
162 representation_model: BaseRepresentation = None,
163 verbose: bool = False,
164 ):
165 """BERTopic initialization.
166
167 Arguments:
168 language: The main language used in your documents. The default sentence-transformers
169 model for "english" is `all-MiniLM-L6-v2`. For a full overview of
170 supported languages see bertopic.backend.languages. Select
171 "multilingual" to load in the `paraphrase-multilingual-MiniLM-L12-v2`
172 sentence-transformers model that supports 50+ languages.
173 NOTE: This is not used if `embedding_model` is used.
174 top_n_words: The number of words per topic to extract. Setting this
175 too high can negatively impact topic embeddings as topics
176 are typically best represented by at most 10 words.
177 n_gram_range: The n-gram range for the CountVectorizer.
178 Advised to keep high values between 1 and 3.
179 More would likely lead to memory issues.
180 NOTE: This param will not be used if you pass in your own
181 CountVectorizer.
182 min_topic_size: The minimum size of the topic. Increasing this value will lead
183 to a lower number of clusters/topics and vice versa.
184 It is the same parameter as `min_cluster_size` in HDBSCAN.
185 NOTE: This param will not be used if you are using `hdbscan_model`.
186 nr_topics: Specifying the number of topics will reduce the initial
187 number of topics to the value specified. This reduction can take
188 a while as each reduction in topics (-1) activates a c-TF-IDF
189 calculation. If this is set to None, no reduction is applied. Use
190 "auto" to automatically reduce topics using HDBSCAN.
191 NOTE: Controlling the number of topics is best done by adjusting
192 `min_topic_size` first before adjusting this parameter.
193 low_memory: Sets UMAP low memory to True to make sure less memory is used.
194 NOTE: This is only used in UMAP. For example, if you use PCA instead of UMAP
195 this parameter will not be used.
196 calculate_probabilities: Calculate the probabilities of all topics
197 per document instead of the probability of the assigned
198 topic per document. This could slow down the extraction
199 of topics if you have many documents (> 100_000).
200 NOTE: If false you cannot use the corresponding
201 visualization method `visualize_probabilities`.
202 NOTE: This is an approximation of topic probabilities

Callers

nothing calls this directly

Calls 3

warningMethod · 0.80
set_levelMethod · 0.80

Tested by

no test coverage detected