Create topics per class. To create the topics per class, BERTopic needs to be already fitted once. From the fitted models, the c-TF-IDF representations are calculated at each class c. Then, the c-TF-IDF representations at class c are averaged with the global c-TF-IDF
(
self,
docs: List[str],
classes: Union[List[int], List[str]],
global_tuning: bool = True,
)
| 954 | return pd.DataFrame(topics_over_time, columns=["Topic", "Words", "Frequency", "Timestamp"]) |
| 955 | |
| 956 | def topics_per_class( |
| 957 | self, |
| 958 | docs: List[str], |
| 959 | classes: Union[List[int], List[str]], |
| 960 | global_tuning: bool = True, |
| 961 | ) -> pd.DataFrame: |
| 962 | """Create topics per class. |
| 963 | |
| 964 | To create the topics per class, BERTopic needs to be already fitted once. |
| 965 | From the fitted models, the c-TF-IDF representations are calculated at |
| 966 | each class c. Then, the c-TF-IDF representations at class c are |
| 967 | averaged with the global c-TF-IDF representations in order to fine-tune the |
| 968 | local representations. This can be turned off if the pure representation is |
| 969 | needed. |
| 970 | |
| 971 | Note: |
| 972 | Make sure to use a limited number of unique classes (<100) as the |
| 973 | c-TF-IDF representation will be calculated at each single unique class. |
| 974 | Having a large number of unique classes can take some time to be calculated. |
| 975 | |
| 976 | Arguments: |
| 977 | docs: The documents you used when calling either `fit` or `fit_transform` |
| 978 | classes: The class of each document. This can be either a list of strings or ints. |
| 979 | global_tuning: Fine-tune each topic representation for class c by averaging its c-TF-IDF matrix |
| 980 | with the global c-TF-IDF matrix. Turn this off if you want to prevent words in |
| 981 | topic representations that could not be found in the documents for class c. |
| 982 | |
| 983 | Returns: |
| 984 | topics_per_class: A dataframe that contains the topic, words, and frequency of topics |
| 985 | for each class. |
| 986 | |
| 987 | Examples: |
| 988 | ```python |
| 989 | from bertopic import BERTopic |
| 990 | topic_model = BERTopic() |
| 991 | topics, probs = topic_model.fit_transform(docs) |
| 992 | topics_per_class = topic_model.topics_per_class(docs, classes) |
| 993 | ``` |
| 994 | """ |
| 995 | check_documents_type(docs) |
| 996 | documents = pd.DataFrame({"Document": docs, "Topic": self.topics_, "Class": classes}) |
| 997 | global_c_tf_idf = normalize(self.c_tf_idf_, axis=1, norm="l1", copy=False) |
| 998 | |
| 999 | # For each unique timestamp, create topic representations |
| 1000 | topics_per_class = [] |
| 1001 | for _, class_ in tqdm(enumerate(set(classes)), disable=not self.verbose): |
| 1002 | # Calculate c-TF-IDF representation for a specific timestamp |
| 1003 | selection = documents.loc[documents.Class == class_, :] |
| 1004 | documents_per_topic = selection.groupby(["Topic"], as_index=False).agg( |
| 1005 | {"Document": " ".join, "Class": "count"} |
| 1006 | ) |
| 1007 | c_tf_idf, words = self._c_tf_idf(documents_per_topic, fit=False) |
| 1008 | |
| 1009 | # Fine-tune the timestamp c-TF-IDF representation based on the global c-TF-IDF representation |
| 1010 | # by simply taking the average of the two |
| 1011 | if global_tuning: |
| 1012 | c_tf_idf = normalize(c_tf_idf, axis=1, norm="l1", copy=False) |
| 1013 | c_tf_idf = (global_c_tf_idf[documents_per_topic.Topic.values + self._outliers] + c_tf_idf) / 2.0 |