hub / github.com/MaartenGr/BERTopic / topics_per_class

Method topics_per_class

bertopic/_bertopic.py:956–1033 · view source on GitHub ↗

Create topics per class. To create the topics per class, BERTopic needs to be already fitted once. From the fitted models, the c-TF-IDF representations are calculated at each class c. Then, the c-TF-IDF representations at class c are averaged with the global c-TF-IDF

(
        self,
        docs: List[str],
        classes: Union[List[int], List[str]],
        global_tuning: bool = True,
    )

Source from the content-addressed store, hash-verified

954	return pd.DataFrame(topics_over_time, columns=["Topic", "Words", "Frequency", "Timestamp"])
955
956	def topics_per_class(
957	self,
958	docs: List[str],
959	classes: Union[List[int], List[str]],
960	global_tuning: bool = True,
961	) -> pd.DataFrame:
962	"""Create topics per class.
963
964	To create the topics per class, BERTopic needs to be already fitted once.
965	From the fitted models, the c-TF-IDF representations are calculated at
966	each class c. Then, the c-TF-IDF representations at class c are
967	averaged with the global c-TF-IDF representations in order to fine-tune the
968	local representations. This can be turned off if the pure representation is
969	needed.
970
971	Note:
972	Make sure to use a limited number of unique classes (<100) as the
973	c-TF-IDF representation will be calculated at each single unique class.
974	Having a large number of unique classes can take some time to be calculated.
975
976	Arguments:
977	docs: The documents you used when calling either `fit` or `fit_transform`
978	classes: The class of each document. This can be either a list of strings or ints.
979	global_tuning: Fine-tune each topic representation for class c by averaging its c-TF-IDF matrix
980	with the global c-TF-IDF matrix. Turn this off if you want to prevent words in
981	topic representations that could not be found in the documents for class c.
982
983	Returns:
984	topics_per_class: A dataframe that contains the topic, words, and frequency of topics
985	for each class.
986
987	Examples:
988	```python
989	from bertopic import BERTopic
990	topic_model = BERTopic()
991	topics, probs = topic_model.fit_transform(docs)
992	topics_per_class = topic_model.topics_per_class(docs, classes)
993	```
994	"""
995	check_documents_type(docs)
996	documents = pd.DataFrame({"Document": docs, "Topic": self.topics_, "Class": classes})
997	global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm="l1", copy=False)
998
999	# For each unique timestamp, create topic representations
1000	topics_per_class = []
1001	for _, class_ in tqdm(enumerate(set(classes)), disable=not self.verbose):
1002	# Calculate c-TF-IDF representation for a specific timestamp
1003	selection = documents.loc[documents.Class == class_, :]
1004	documents_per_topic = selection.groupby(["Topic"], as_index=False).agg(
1005	{"Document": " ".join, "Class": "count"}
1006	)
1007	c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)
1008
1009	# Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation
1010	# by simply taking the average of the two
1011	if global_tuning:
1012	c_tf_idf = normalize(c_tf_idf, axis=1, norm="l1", copy=False)
1013	c_tf_idf = (global_c_tf_idf[documents_per_topic.Topic.values + self._outliers] + c_tf_idf) / 2.0

Callers 1

test_classFunction · 0.80

Calls 3

_c_tf_idfMethod · 0.95

_extract_words_per_topicMethod · 0.95

check_documents_typeFunction · 0.90

Tested by 1

test_classFunction · 0.64