Create a hierarchy of topics. To create this hierarchy, BERTopic needs to be already fitted once. Then, a hierarchy is calculated on the distance matrix of the c-TF-IDF or topic embeddings representation using `scipy.cluster.hierarchy.linkage`. Based on that hierarc
(
self,
docs: List[str],
use_ctfidf: bool = True,
linkage_function: Callable[[csr_matrix], np.ndarray] | None = None,
distance_function: Callable[[csr_matrix], csr_matrix] | None = None,
)
| 1033 | return topics_per_class |
| 1034 | |
| 1035 | def hierarchical_topics( |
| 1036 | self, |
| 1037 | docs: List[str], |
| 1038 | use_ctfidf: bool = True, |
| 1039 | linkage_function: Callable[[csr_matrix], np.ndarray] | None = None, |
| 1040 | distance_function: Callable[[csr_matrix], csr_matrix] | None = None, |
| 1041 | ) -> pd.DataFrame: |
| 1042 | """Create a hierarchy of topics. |
| 1043 | |
| 1044 | To create this hierarchy, BERTopic needs to be already fitted once. |
| 1045 | Then, a hierarchy is calculated on the distance matrix of the c-TF-IDF or topic embeddings |
| 1046 | representation using `scipy.cluster.hierarchy.linkage`. |
| 1047 | |
| 1048 | Based on that hierarchy, we calculate the topic representation at each |
| 1049 | merged step. This is a local representation, as we only assume that the |
| 1050 | chosen step is merged and not all others which typically improves the |
| 1051 | topic representation. |
| 1052 | |
| 1053 | Arguments: |
| 1054 | docs: The documents you used when calling either `fit` or `fit_transform` |
| 1055 | use_ctfidf: Whether to calculate distances between topics based on c-TF-IDF embeddings. If False, the |
| 1056 | embeddings from the embedding model are used. |
| 1057 | linkage_function: The linkage function to use. Default is: |
| 1058 | `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)` |
| 1059 | distance_function: The distance function to use on the c-TF-IDF matrix. Default is: |
| 1060 | `lambda x: 1 - cosine_similarity(x)`. |
| 1061 | You can pass any function that returns either a square matrix of |
| 1062 | shape (n_samples, n_samples) with zeros on the diagonal and |
| 1063 | non-negative values or condensed distance matrix of shape |
| 1064 | (n_samples * (n_samples - 1) / 2,) containing the upper |
| 1065 | triangular of the distance matrix. |
| 1066 | |
| 1067 | Returns: |
| 1068 | hierarchical_topics: A dataframe that contains a hierarchy of topics |
| 1069 | represented by their parents and their children |
| 1070 | |
| 1071 | Examples: |
| 1072 | ```python |
| 1073 | from bertopic import BERTopic |
| 1074 | topic_model = BERTopic() |
| 1075 | topics, probs = topic_model.fit_transform(docs) |
| 1076 | hierarchical_topics = topic_model.hierarchical_topics(docs) |
| 1077 | ``` |
| 1078 | |
| 1079 | A custom linkage function can be used as follows: |
| 1080 | |
| 1081 | ```python |
| 1082 | from scipy.cluster import hierarchy as sch |
| 1083 | from bertopic import BERTopic |
| 1084 | topic_model = BERTopic() |
| 1085 | topics, probs = topic_model.fit_transform(docs) |
| 1086 | |
| 1087 | # Hierarchical topics |
| 1088 | linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True) |
| 1089 | hierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function) |
| 1090 | ``` |
| 1091 | """ |
| 1092 | check_documents_type(docs) |