hub / github.com/MaartenGr/BERTopic / __init__

Method init

bertopic/_bertopic.py:145–315 · view source on GitHub ↗

BERTopic initialization. Arguments: language: The main language used in your documents. The default sentence-transformers model for "english" is `all-MiniLM-L6-v2`. For a full overview of supported languages see bertopic.backend.langua

(
        self,
        language: str = "english",
        top_n_words: int = 10,
        n_gram_range: Tuple[int, int] = (1, 1),
        min_topic_size: int = 10,
        nr_topics: Union[int, str] | None = None,
        low_memory: bool = False,
        calculate_probabilities: bool = False,
        seed_topic_list: List[List[str]] | None = None,
        zeroshot_topic_list: List[str] | None = None,
        zeroshot_min_similarity: float = 0.7,
        embedding_model=None,
        umap_model=None,
        hdbscan_model=None,
        vectorizer_model: CountVectorizer = None,
        ctfidf_model: TfidfTransformer = None,
        representation_model: BaseRepresentation = None,
        verbose: bool = False,
    )

Source from the content-addressed store, hash-verified

143	"""
144
145	def __init__(
146	self,
147	language: str = "english",
148	top_n_words: int = 10,
149	n_gram_range: Tuple[int, int] = (1, 1),
150	min_topic_size: int = 10,
151	nr_topics: Union[int, str] \| None = None,
152	low_memory: bool = False,
153	calculate_probabilities: bool = False,
154	seed_topic_list: List[List[str]] \| None = None,
155	zeroshot_topic_list: List[str] \| None = None,
156	zeroshot_min_similarity: float = 0.7,
157	embedding_model=None,
158	umap_model=None,
159	hdbscan_model=None,
160	vectorizer_model: CountVectorizer = None,
161	ctfidf_model: TfidfTransformer = None,
162	representation_model: BaseRepresentation = None,
163	verbose: bool = False,
164	):
165	"""BERTopic initialization.
166
167	Arguments:
168	language: The main language used in your documents. The default sentence-transformers
169	model for "english" is `all-MiniLM-L6-v2`. For a full overview of
170	supported languages see bertopic.backend.languages. Select
171	"multilingual" to load in the `paraphrase-multilingual-MiniLM-L12-v2`
172	sentence-transformers model that supports 50+ languages.
173	NOTE: This is not used if `embedding_model` is used.
174	top_n_words: The number of words per topic to extract. Setting this
175	too high can negatively impact topic embeddings as topics
176	are typically best represented by at most 10 words.
177	n_gram_range: The n-gram range for the CountVectorizer.
178	Advised to keep high values between 1 and 3.
179	More would likely lead to memory issues.
180	NOTE: This param will not be used if you pass in your own
181	CountVectorizer.
182	min_topic_size: The minimum size of the topic. Increasing this value will lead
183	to a lower number of clusters/topics and vice versa.
184	It is the same parameter as `min_cluster_size` in HDBSCAN.
185	NOTE: This param will not be used if you are using `hdbscan_model`.
186	nr_topics: Specifying the number of topics will reduce the initial
187	number of topics to the value specified. This reduction can take
188	a while as each reduction in topics (-1) activates a c-TF-IDF
189	calculation. If this is set to None, no reduction is applied. Use
190	"auto" to automatically reduce topics using HDBSCAN.
191	NOTE: Controlling the number of topics is best done by adjusting
192	`min_topic_size` first before adjusting this parameter.
193	low_memory: Sets UMAP low memory to True to make sure less memory is used.
194	NOTE: This is only used in UMAP. For example, if you use PCA instead of UMAP
195	this parameter will not be used.
196	calculate_probabilities: Calculate the probabilities of all topics
197	per document instead of the probability of the assigned
198	topic per document. This could slow down the extraction
199	of topics if you have many documents (> 100_000).
200	NOTE: If false you cannot use the corresponding
201	visualization method `visualize_probabilities`.
202	NOTE: This is an approximation of topic probabilities

Callers

nothing calls this directly

Calls 3

ClassTfidfTransformerClass · 0.90

warningMethod · 0.80

set_levelMethod · 0.80

Tested by

no test coverage detected

Method __init__

Source from the content-addressed store, hash-verified

Callers

Calls 3

Tested by

Method init