MCPcopy
hub / github.com/MaartenGr/BERTopic / topics_per_class

Method topics_per_class

bertopic/_bertopic.py:956–1033  ·  view source on GitHub ↗

Create topics per class. To create the topics per class, BERTopic needs to be already fitted once. From the fitted models, the c-TF-IDF representations are calculated at each class c. Then, the c-TF-IDF representations at class c are averaged with the global c-TF-IDF

(
        self,
        docs: List[str],
        classes: Union[List[int], List[str]],
        global_tuning: bool = True,
    )

Source from the content-addressed store, hash-verified

954 return pd.DataFrame(topics_over_time, columns=["Topic", "Words", "Frequency", "Timestamp"])
955
956 def topics_per_class(
957 self,
958 docs: List[str],
959 classes: Union[List[int], List[str]],
960 global_tuning: bool = True,
961 ) -> pd.DataFrame:
962 """Create topics per class.
963
964 To create the topics per class, BERTopic needs to be already fitted once.
965 From the fitted models, the c-TF-IDF representations are calculated at
966 each class c. Then, the c-TF-IDF representations at class c are
967 averaged with the global c-TF-IDF representations in order to fine-tune the
968 local representations. This can be turned off if the pure representation is
969 needed.
970
971 Note:
972 Make sure to use a limited number of unique classes (<100) as the
973 c-TF-IDF representation will be calculated at each single unique class.
974 Having a large number of unique classes can take some time to be calculated.
975
976 Arguments:
977 docs: The documents you used when calling either `fit` or `fit_transform`
978 classes: The class of each document. This can be either a list of strings or ints.
979 global_tuning: Fine-tune each topic representation for class c by averaging its c-TF-IDF matrix
980 with the global c-TF-IDF matrix. Turn this off if you want to prevent words in
981 topic representations that could not be found in the documents for class c.
982
983 Returns:
984 topics_per_class: A dataframe that contains the topic, words, and frequency of topics
985 for each class.
986
987 Examples:
988 ```python
989 from bertopic import BERTopic
990 topic_model = BERTopic()
991 topics, probs = topic_model.fit_transform(docs)
992 topics_per_class = topic_model.topics_per_class(docs, classes)
993 ```
994 """
995 check_documents_type(docs)
996 documents = pd.DataFrame({"Document": docs, "Topic": self.topics_, "Class": classes})
997 global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm="l1", copy=False)
998
999 # For each unique timestamp, create topic representations
1000 topics_per_class = []
1001 for _, class_ in tqdm(enumerate(set(classes)), disable=not self.verbose):
1002 # Calculate c-TF-IDF representation for a specific timestamp
1003 selection = documents.loc[documents.Class == class_, :]
1004 documents_per_topic = selection.groupby(["Topic"], as_index=False).agg(
1005 {"Document": " ".join, "Class": "count"}
1006 )
1007 c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False)
1008
1009 # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation
1010 # by simply taking the average of the two
1011 if global_tuning:
1012 c_tf_idf = normalize(c_tf_idf, axis=1, norm="l1", copy=False)
1013 c_tf_idf = (global_c_tf_idf[documents_per_topic.Topic.values + self._outliers] + c_tf_idf) / 2.0

Callers 1

test_classFunction · 0.80

Calls 3

_c_tf_idfMethod · 0.95
check_documents_typeFunction · 0.90

Tested by 1

test_classFunction · 0.64