MCPcopy
hub / github.com/MaartenGr/BERTopic / topics_over_time

Method topics_over_time

bertopic/_bertopic.py:797–954  ·  view source on GitHub ↗

Create topics over time. To create the topics over time, BERTopic needs to be already fitted once. From the fitted models, the c-TF-IDF representations are calculate at each timestamp t. Then, the c-TF-IDF representations at timestamp t are averaged with the global c

(
        self,
        docs: List[str],
        timestamps: Union[List[str], List[int]],
        topics: List[int] | None = None,
        nr_bins: int | None = None,
        datetime_format: str | None = None,
        evolution_tuning: bool = True,
        global_tuning: bool = True,
    )

Source from the content-addressed store, hash-verified

795 return self
796
797 def topics_over_time(
798 self,
799 docs: List[str],
800 timestamps: Union[List[str], List[int]],
801 topics: List[int] | None = None,
802 nr_bins: int | None = None,
803 datetime_format: str | None = None,
804 evolution_tuning: bool = True,
805 global_tuning: bool = True,
806 ) -> pd.DataFrame:
807 """Create topics over time.
808
809 To create the topics over time, BERTopic needs to be already fitted once.
810 From the fitted models, the c-TF-IDF representations are calculate at
811 each timestamp t. Then, the c-TF-IDF representations at timestamp t are
812 averaged with the global c-TF-IDF representations in order to fine-tune the
813 local representations.
814
815 Note:
816 Make sure to use a limited number of unique timestamps (<100) as the
817 c-TF-IDF representation will be calculated at each single unique timestamp.
818 Having a large number of unique timestamps can take some time to be calculated.
819 Moreover, there aren&#x27;t many use-cases where you would like to see the difference
820 in topic representations over more than 100 different timestamps.
821
822 Arguments:
823 docs: The documents you used when calling either `fit` or `fit_transform`
824 timestamps: The timestamp of each document. This can be either a list of strings or ints.
825 If it is a list of strings, then the datetime format will be automatically
826 inferred. If it is a list of ints, then the documents will be ordered in
827 ascending order.
828 topics: A list of topics where each topic is related to a document in `docs` and
829 a timestamp in `timestamps`. You can use this to apply topics_over_time on
830 a subset of the data. Make sure that `docs`, `timestamps`, and `topics`
831 all correspond to one another and have the same size.
832 nr_bins: The number of bins you want to create for the timestamps. The left interval will
833 be chosen as the timestamp. An additional column will be created with the
834 entire interval.
835 datetime_format: The datetime format of the timestamps if they are strings, eg "%d/%m/%Y".
836 Set this to None if you want to have it automatically detect the format.
837 See strftime documentation for more information on choices:
838 https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.
839 evolution_tuning: Fine-tune each topic representation at timestamp *t* by averaging its
840 c-TF-IDF matrix with the c-TF-IDF matrix at timestamp *t-1*. This creates
841 evolutionary topic representations.
842 global_tuning: Fine-tune each topic representation at timestamp *t* by averaging its c-TF-IDF matrix
843 with the global c-TF-IDF matrix. Turn this off if you want to prevent words in
844 topic representations that could not be found in the documents at timestamp *t*.
845
846 Returns:
847 topics_over_time: A dataframe that contains the topic, words, and frequency of topic
848 at timestamp *t*.
849
850 Examples:
851 The timestamps variable represents the timestamp of each document. If you have over
852 100 unique timestamps, it is advised to bin the timestamps as shown below:
853
854 ```python

Callers 3

test_full_modelFunction · 0.80
test_dynamicFunction · 0.80
test_dynamicFunction · 0.80

Calls 5

_c_tf_idfMethod · 0.95
check_is_fittedFunction · 0.90
check_documents_typeFunction · 0.90
warningMethod · 0.80

Tested by 3

test_full_modelFunction · 0.64
test_dynamicFunction · 0.64
test_dynamicFunction · 0.64