MCPcopy
hub / github.com/MaartenGr/BERTopic / get_document_info

Method get_document_info

bertopic/_bertopic.py:1734–1824  ·  view source on GitHub ↗

Get information about the documents on which the topic was trained including the documents themselves, their respective topics, the name of each topic, the top n words of each topic, whether it is a representative document, and probability of the clustering if the cluster

(
        self,
        docs: List[str],
        df: pd.DataFrame = None,
        metadata: Mapping[str, Any] | None = None,
    )

Source from the content-addressed store, hash-verified

1732 )
1733
1734 def get_document_info(
1735 self,
1736 docs: List[str],
1737 df: pd.DataFrame = None,
1738 metadata: Mapping[str, Any] | None = None,
1739 ) -> pd.DataFrame:
1740 """Get information about the documents on which the topic was trained
1741 including the documents themselves, their respective topics, the name
1742 of each topic, the top n words of each topic, whether it is a
1743 representative document, and probability of the clustering if the cluster
1744 model supports it.
1745
1746 There are also options to include other meta data, such as the topic
1747 distributions or the x and y coordinates of the reduced embeddings.
1748
1749 Arguments:
1750 docs: The documents on which the topic model was trained.
1751 df: A dataframe containing the metadata and the documents on which
1752 the topic model was originally trained on.
1753 metadata: A dictionary with meta data for each document in the form
1754 of column name (key) and the respective values (value).
1755
1756 Returns:
1757 document_info: A dataframe with several statistics regarding
1758 the documents on which the topic model was trained.
1759
1760 Usage:
1761
1762 To get the document info, you will only need to pass the documents on which
1763 the topic model was trained:
1764
1765 ```python
1766 document_info = topic_model.get_document_info(docs)
1767 ```
1768
1769 There are additionally options to include meta data, such as the topic
1770 distributions. Moreover, we can pass the original dataframe that contains
1771 the documents and extend it with the information retrieved from BERTopic:
1772
1773 ```python
1774 from sklearn.datasets import fetch_20newsgroups
1775
1776 # The original data in a dataframe format to include the target variable
1777 data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
1778 df = pd.DataFrame({"Document": data['data'], "Class": data['target']})
1779
1780 # Add information about the percentage of the document that relates to the topic
1781 topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=1000)
1782 distributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)]
1783
1784 # Create our documents dataframe using the original dataframe and meta data about
1785 # the topic distributions
1786 document_info = topic_model.get_document_info(docs, df=df,
1787 metadata={"Topic_distribution": distributions})
1788 """
1789 check_documents_type(docs)
1790 if df is not None:
1791 document_info = df.copy()

Callers 1

test_full_modelFunction · 0.80

Calls 3

get_topic_infoMethod · 0.95
get_topicMethod · 0.95
check_documents_typeFunction · 0.90

Tested by 1

test_full_modelFunction · 0.64