MCPcopy
hub / github.com/MaartenGr/BERTopic / hierarchical_topics

Method hierarchical_topics

bertopic/_bertopic.py:1035–1202  ·  view source on GitHub ↗

Create a hierarchy of topics. To create this hierarchy, BERTopic needs to be already fitted once. Then, a hierarchy is calculated on the distance matrix of the c-TF-IDF or topic embeddings representation using `scipy.cluster.hierarchy.linkage`. Based on that hierarc

(
        self,
        docs: List[str],
        use_ctfidf: bool = True,
        linkage_function: Callable[[csr_matrix], np.ndarray] | None = None,
        distance_function: Callable[[csr_matrix], csr_matrix] | None = None,
    )

Source from the content-addressed store, hash-verified

1033 return topics_per_class
1034
1035 def hierarchical_topics(
1036 self,
1037 docs: List[str],
1038 use_ctfidf: bool = True,
1039 linkage_function: Callable[[csr_matrix], np.ndarray] | None = None,
1040 distance_function: Callable[[csr_matrix], csr_matrix] | None = None,
1041 ) -> pd.DataFrame:
1042 """Create a hierarchy of topics.
1043
1044 To create this hierarchy, BERTopic needs to be already fitted once.
1045 Then, a hierarchy is calculated on the distance matrix of the c-TF-IDF or topic embeddings
1046 representation using `scipy.cluster.hierarchy.linkage`.
1047
1048 Based on that hierarchy, we calculate the topic representation at each
1049 merged step. This is a local representation, as we only assume that the
1050 chosen step is merged and not all others which typically improves the
1051 topic representation.
1052
1053 Arguments:
1054 docs: The documents you used when calling either `fit` or `fit_transform`
1055 use_ctfidf: Whether to calculate distances between topics based on c-TF-IDF embeddings. If False, the
1056 embeddings from the embedding model are used.
1057 linkage_function: The linkage function to use. Default is:
1058 `lambda x: sch.linkage(x, 'ward', optimal_ordering=True)`
1059 distance_function: The distance function to use on the c-TF-IDF matrix. Default is:
1060 `lambda x: 1 - cosine_similarity(x)`.
1061 You can pass any function that returns either a square matrix of
1062 shape (n_samples, n_samples) with zeros on the diagonal and
1063 non-negative values or condensed distance matrix of shape
1064 (n_samples * (n_samples - 1) / 2,) containing the upper
1065 triangular of the distance matrix.
1066
1067 Returns:
1068 hierarchical_topics: A dataframe that contains a hierarchy of topics
1069 represented by their parents and their children
1070
1071 Examples:
1072 ```python
1073 from bertopic import BERTopic
1074 topic_model = BERTopic()
1075 topics, probs = topic_model.fit_transform(docs)
1076 hierarchical_topics = topic_model.hierarchical_topics(docs)
1077 ```
1078
1079 A custom linkage function can be used as follows:
1080
1081 ```python
1082 from scipy.cluster import hierarchy as sch
1083 from bertopic import BERTopic
1084 topic_model = BERTopic()
1085 topics, probs = topic_model.fit_transform(docs)
1086
1087 # Hierarchical topics
1088 linkage_function = lambda x: sch.linkage(x, 'ward', optimal_ordering=True)
1089 hierarchical_topics = topic_model.hierarchical_topics(docs, linkage_function=linkage_function)
1090 ```
1091 """
1092 check_documents_type(docs)

Callers 4

test_full_modelFunction · 0.80
test_hierarchyFunction · 0.80
test_linkageFunction · 0.80
test_treeFunction · 0.80

Calls 8

_preprocess_textMethod · 0.95
get_topicMethod · 0.95
check_documents_typeFunction · 0.90
validate_distance_matrixFunction · 0.90
get_unique_distancesFunction · 0.90
transformMethod · 0.45

Tested by 4

test_full_modelFunction · 0.64
test_hierarchyFunction · 0.64
test_linkageFunction · 0.64
test_treeFunction · 0.64