hub / github.com/MaartenGr/BERTopic / get_document_info

Method get_document_info

bertopic/_bertopic.py:1734–1824 · view source on GitHub ↗

Get information about the documents on which the topic was trained including the documents themselves, their respective topics, the name of each topic, the top n words of each topic, whether it is a representative document, and probability of the clustering if the cluster

(
        self,
        docs: List[str],
        df: pd.DataFrame = None,
        metadata: Mapping[str, Any] | None = None,
    )

Source from the content-addressed store, hash-verified

1732	)
1733
1734	def get_document_info(
1735	self,
1736	docs: List[str],
1737	df: pd.DataFrame = None,
1738	metadata: Mapping[str, Any] \| None = None,
1739	) -> pd.DataFrame:
1740	"""Get information about the documents on which the topic was trained
1741	including the documents themselves, their respective topics, the name
1742	of each topic, the top n words of each topic, whether it is a
1743	representative document, and probability of the clustering if the cluster
1744	model supports it.
1745
1746	There are also options to include other meta data, such as the topic
1747	distributions or the x and y coordinates of the reduced embeddings.
1748
1749	Arguments:
1750	docs: The documents on which the topic model was trained.
1751	df: A dataframe containing the metadata and the documents on which
1752	the topic model was originally trained on.
1753	metadata: A dictionary with meta data for each document in the form
1754	of column name (key) and the respective values (value).
1755
1756	Returns:
1757	document_info: A dataframe with several statistics regarding
1758	the documents on which the topic model was trained.
1759
1760	Usage:
1761
1762	To get the document info, you will only need to pass the documents on which
1763	the topic model was trained:
1764
1765	```python
1766	document_info = topic_model.get_document_info(docs)
1767	```
1768
1769	There are additionally options to include meta data, such as the topic
1770	distributions. Moreover, we can pass the original dataframe that contains
1771	the documents and extend it with the information retrieved from BERTopic:
1772
1773	```python
1774	from sklearn.datasets import fetch_20newsgroups
1775
1776	# The original data in a dataframe format to include the target variable
1777	data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
1778	df = pd.DataFrame({"Document": data['data'], "Class": data['target']})
1779
1780	# Add information about the percentage of the document that relates to the topic
1781	topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=1000)
1782	distributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)]
1783
1784	# Create our documents dataframe using the original dataframe and meta data about
1785	# the topic distributions
1786	document_info = topic_model.get_document_info(docs, df=df,
1787	metadata={"Topic_distribution": distributions})
1788	"""
1789	check_documents_type(docs)
1790	if df is not None:
1791	document_info = df.copy()

Callers 1

test_full_modelFunction · 0.80

Calls 3

get_topic_infoMethod · 0.95

get_topicMethod · 0.95

check_documents_typeFunction · 0.90

Tested by 1

test_full_modelFunction · 0.64