BERTopic initialization. Arguments: language: The main language used in your documents. The default sentence-transformers model for "english" is `all-MiniLM-L6-v2`. For a full overview of supported languages see bertopic.backend.langua
(
self,
language: str = "english",
top_n_words: int = 10,
n_gram_range: Tuple[int, int] = (1, 1),
min_topic_size: int = 10,
nr_topics: Union[int, str] | None = None,
low_memory: bool = False,
calculate_probabilities: bool = False,
seed_topic_list: List[List[str]] | None = None,
zeroshot_topic_list: List[str] | None = None,
zeroshot_min_similarity: float = 0.7,
embedding_model=None,
umap_model=None,
hdbscan_model=None,
vectorizer_model: CountVectorizer = None,
ctfidf_model: TfidfTransformer = None,
representation_model: BaseRepresentation = None,
verbose: bool = False,
)
| 143 | """ |
| 144 | |
| 145 | def __init__( |
| 146 | self, |
| 147 | language: str = "english", |
| 148 | top_n_words: int = 10, |
| 149 | n_gram_range: Tuple[int, int] = (1, 1), |
| 150 | min_topic_size: int = 10, |
| 151 | nr_topics: Union[int, str] | None = None, |
| 152 | low_memory: bool = False, |
| 153 | calculate_probabilities: bool = False, |
| 154 | seed_topic_list: List[List[str]] | None = None, |
| 155 | zeroshot_topic_list: List[str] | None = None, |
| 156 | zeroshot_min_similarity: float = 0.7, |
| 157 | embedding_model=None, |
| 158 | umap_model=None, |
| 159 | hdbscan_model=None, |
| 160 | vectorizer_model: CountVectorizer = None, |
| 161 | ctfidf_model: TfidfTransformer = None, |
| 162 | representation_model: BaseRepresentation = None, |
| 163 | verbose: bool = False, |
| 164 | ): |
| 165 | """BERTopic initialization. |
| 166 | |
| 167 | Arguments: |
| 168 | language: The main language used in your documents. The default sentence-transformers |
| 169 | model for "english" is `all-MiniLM-L6-v2`. For a full overview of |
| 170 | supported languages see bertopic.backend.languages. Select |
| 171 | "multilingual" to load in the `paraphrase-multilingual-MiniLM-L12-v2` |
| 172 | sentence-transformers model that supports 50+ languages. |
| 173 | NOTE: This is not used if `embedding_model` is used. |
| 174 | top_n_words: The number of words per topic to extract. Setting this |
| 175 | too high can negatively impact topic embeddings as topics |
| 176 | are typically best represented by at most 10 words. |
| 177 | n_gram_range: The n-gram range for the CountVectorizer. |
| 178 | Advised to keep high values between 1 and 3. |
| 179 | More would likely lead to memory issues. |
| 180 | NOTE: This param will not be used if you pass in your own |
| 181 | CountVectorizer. |
| 182 | min_topic_size: The minimum size of the topic. Increasing this value will lead |
| 183 | to a lower number of clusters/topics and vice versa. |
| 184 | It is the same parameter as `min_cluster_size` in HDBSCAN. |
| 185 | NOTE: This param will not be used if you are using `hdbscan_model`. |
| 186 | nr_topics: Specifying the number of topics will reduce the initial |
| 187 | number of topics to the value specified. This reduction can take |
| 188 | a while as each reduction in topics (-1) activates a c-TF-IDF |
| 189 | calculation. If this is set to None, no reduction is applied. Use |
| 190 | "auto" to automatically reduce topics using HDBSCAN. |
| 191 | NOTE: Controlling the number of topics is best done by adjusting |
| 192 | `min_topic_size` first before adjusting this parameter. |
| 193 | low_memory: Sets UMAP low memory to True to make sure less memory is used. |
| 194 | NOTE: This is only used in UMAP. For example, if you use PCA instead of UMAP |
| 195 | this parameter will not be used. |
| 196 | calculate_probabilities: Calculate the probabilities of all topics |
| 197 | per document instead of the probability of the assigned |
| 198 | topic per document. This could slow down the extraction |
| 199 | of topics if you have many documents (> 100_000). |
| 200 | NOTE: If false you cannot use the corresponding |
| 201 | visualization method `visualize_probabilities`. |
| 202 | NOTE: This is an approximation of topic probabilities |
nothing calls this directly
no test coverage detected