Get information about the documents on which the topic was trained including the documents themselves, their respective topics, the name of each topic, the top n words of each topic, whether it is a representative document, and probability of the clustering if the cluster
(
self,
docs: List[str],
df: pd.DataFrame = None,
metadata: Mapping[str, Any] | None = None,
)
| 1732 | ) |
| 1733 | |
| 1734 | def get_document_info( |
| 1735 | self, |
| 1736 | docs: List[str], |
| 1737 | df: pd.DataFrame = None, |
| 1738 | metadata: Mapping[str, Any] | None = None, |
| 1739 | ) -> pd.DataFrame: |
| 1740 | """Get information about the documents on which the topic was trained |
| 1741 | including the documents themselves, their respective topics, the name |
| 1742 | of each topic, the top n words of each topic, whether it is a |
| 1743 | representative document, and probability of the clustering if the cluster |
| 1744 | model supports it. |
| 1745 | |
| 1746 | There are also options to include other meta data, such as the topic |
| 1747 | distributions or the x and y coordinates of the reduced embeddings. |
| 1748 | |
| 1749 | Arguments: |
| 1750 | docs: The documents on which the topic model was trained. |
| 1751 | df: A dataframe containing the metadata and the documents on which |
| 1752 | the topic model was originally trained on. |
| 1753 | metadata: A dictionary with meta data for each document in the form |
| 1754 | of column name (key) and the respective values (value). |
| 1755 | |
| 1756 | Returns: |
| 1757 | document_info: A dataframe with several statistics regarding |
| 1758 | the documents on which the topic model was trained. |
| 1759 | |
| 1760 | Usage: |
| 1761 | |
| 1762 | To get the document info, you will only need to pass the documents on which |
| 1763 | the topic model was trained: |
| 1764 | |
| 1765 | ```python |
| 1766 | document_info = topic_model.get_document_info(docs) |
| 1767 | ``` |
| 1768 | |
| 1769 | There are additionally options to include meta data, such as the topic |
| 1770 | distributions. Moreover, we can pass the original dataframe that contains |
| 1771 | the documents and extend it with the information retrieved from BERTopic: |
| 1772 | |
| 1773 | ```python |
| 1774 | from sklearn.datasets import fetch_20newsgroups |
| 1775 | |
| 1776 | # The original data in a dataframe format to include the target variable |
| 1777 | data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')) |
| 1778 | df = pd.DataFrame({"Document": data['data'], "Class": data['target']}) |
| 1779 | |
| 1780 | # Add information about the percentage of the document that relates to the topic |
| 1781 | topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=1000) |
| 1782 | distributions = [distr[topic] if topic != -1 else 0 for topic, distr in zip(topics, topic_distr)] |
| 1783 | |
| 1784 | # Create our documents dataframe using the original dataframe and meta data about |
| 1785 | # the topic distributions |
| 1786 | document_info = topic_model.get_document_info(docs, df=df, |
| 1787 | metadata={"Topic_distribution": distributions}) |
| 1788 | """ |
| 1789 | check_documents_type(docs) |
| 1790 | if df is not None: |
| 1791 | document_info = df.copy() |