Create topics over time. To create the topics over time, BERTopic needs to be already fitted once. From the fitted models, the c-TF-IDF representations are calculate at each timestamp t. Then, the c-TF-IDF representations at timestamp t are averaged with the global c
(
self,
docs: List[str],
timestamps: Union[List[str], List[int]],
topics: List[int] | None = None,
nr_bins: int | None = None,
datetime_format: str | None = None,
evolution_tuning: bool = True,
global_tuning: bool = True,
)
| 795 | return self |
| 796 | |
| 797 | def topics_over_time( |
| 798 | self, |
| 799 | docs: List[str], |
| 800 | timestamps: Union[List[str], List[int]], |
| 801 | topics: List[int] | None = None, |
| 802 | nr_bins: int | None = None, |
| 803 | datetime_format: str | None = None, |
| 804 | evolution_tuning: bool = True, |
| 805 | global_tuning: bool = True, |
| 806 | ) -> pd.DataFrame: |
| 807 | """Create topics over time. |
| 808 | |
| 809 | To create the topics over time, BERTopic needs to be already fitted once. |
| 810 | From the fitted models, the c-TF-IDF representations are calculate at |
| 811 | each timestamp t. Then, the c-TF-IDF representations at timestamp t are |
| 812 | averaged with the global c-TF-IDF representations in order to fine-tune the |
| 813 | local representations. |
| 814 | |
| 815 | Note: |
| 816 | Make sure to use a limited number of unique timestamps (<100) as the |
| 817 | c-TF-IDF representation will be calculated at each single unique timestamp. |
| 818 | Having a large number of unique timestamps can take some time to be calculated. |
| 819 | Moreover, there aren't many use-cases where you would like to see the difference |
| 820 | in topic representations over more than 100 different timestamps. |
| 821 | |
| 822 | Arguments: |
| 823 | docs: The documents you used when calling either `fit` or `fit_transform` |
| 824 | timestamps: The timestamp of each document. This can be either a list of strings or ints. |
| 825 | If it is a list of strings, then the datetime format will be automatically |
| 826 | inferred. If it is a list of ints, then the documents will be ordered in |
| 827 | ascending order. |
| 828 | topics: A list of topics where each topic is related to a document in `docs` and |
| 829 | a timestamp in `timestamps`. You can use this to apply topics_over_time on |
| 830 | a subset of the data. Make sure that `docs`, `timestamps`, and `topics` |
| 831 | all correspond to one another and have the same size. |
| 832 | nr_bins: The number of bins you want to create for the timestamps. The left interval will |
| 833 | be chosen as the timestamp. An additional column will be created with the |
| 834 | entire interval. |
| 835 | datetime_format: The datetime format of the timestamps if they are strings, eg "%d/%m/%Y". |
| 836 | Set this to None if you want to have it automatically detect the format. |
| 837 | See strftime documentation for more information on choices: |
| 838 | https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior. |
| 839 | evolution_tuning: Fine-tune each topic representation at timestamp *t* by averaging its |
| 840 | c-TF-IDF matrix with the c-TF-IDF matrix at timestamp *t-1*. This creates |
| 841 | evolutionary topic representations. |
| 842 | global_tuning: Fine-tune each topic representation at timestamp *t* by averaging its c-TF-IDF matrix |
| 843 | with the global c-TF-IDF matrix. Turn this off if you want to prevent words in |
| 844 | topic representations that could not be found in the documents at timestamp *t*. |
| 845 | |
| 846 | Returns: |
| 847 | topics_over_time: A dataframe that contains the topic, words, and frequency of topic |
| 848 | at timestamp *t*. |
| 849 | |
| 850 | Examples: |
| 851 | The timestamps variable represents the timestamp of each document. If you have over |
| 852 | 100 unique timestamps, it is advised to bin the timestamps as shown below: |
| 853 | |
| 854 | ```python |