hub / github.com/MaartenGr/BERTopic / topics_over_time

Method topics_over_time

bertopic/_bertopic.py:797–954 · view source on GitHub ↗

Create topics over time. To create the topics over time, BERTopic needs to be already fitted once. From the fitted models, the c-TF-IDF representations are calculate at each timestamp t. Then, the c-TF-IDF representations at timestamp t are averaged with the global c

(
        self,
        docs: List[str],
        timestamps: Union[List[str], List[int]],
        topics: List[int] | None = None,
        nr_bins: int | None = None,
        datetime_format: str | None = None,
        evolution_tuning: bool = True,
        global_tuning: bool = True,
    )

Source from the content-addressed store, hash-verified

795	return self
796
797	def topics_over_time(
798	self,
799	docs: List[str],
800	timestamps: Union[List[str], List[int]],
801	topics: List[int] \| None = None,
802	nr_bins: int \| None = None,
803	datetime_format: str \| None = None,
804	evolution_tuning: bool = True,
805	global_tuning: bool = True,
806	) -> pd.DataFrame:
807	"""Create topics over time.
808
809	To create the topics over time, BERTopic needs to be already fitted once.
810	From the fitted models, the c-TF-IDF representations are calculate at
811	each timestamp t. Then, the c-TF-IDF representations at timestamp t are
812	averaged with the global c-TF-IDF representations in order to fine-tune the
813	local representations.
814
815	Note:
816	Make sure to use a limited number of unique timestamps (<100) as the
817	c-TF-IDF representation will be calculated at each single unique timestamp.
818	Having a large number of unique timestamps can take some time to be calculated.
819	Moreover, there aren't many use-cases where you would like to see the difference
820	in topic representations over more than 100 different timestamps.
821
822	Arguments:
823	docs: The documents you used when calling either `fit` or `fit_transform`
824	timestamps: The timestamp of each document. This can be either a list of strings or ints.
825	If it is a list of strings, then the datetime format will be automatically
826	inferred. If it is a list of ints, then the documents will be ordered in
827	ascending order.
828	topics: A list of topics where each topic is related to a document in `docs` and
829	a timestamp in `timestamps`. You can use this to apply topics_over_time on
830	a subset of the data. Make sure that `docs`, `timestamps`, and `topics`
831	all correspond to one another and have the same size.
832	nr_bins: The number of bins you want to create for the timestamps. The left interval will
833	be chosen as the timestamp. An additional column will be created with the
834	entire interval.
835	datetime_format: The datetime format of the timestamps if they are strings, eg "%d/%m/%Y".
836	Set this to None if you want to have it automatically detect the format.
837	See strftime documentation for more information on choices:
838	https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.
839	evolution_tuning: Fine-tune each topic representation at timestamp t by averaging its
840	c-TF-IDF matrix with the c-TF-IDF matrix at timestamp t-1. This creates
841	evolutionary topic representations.
842	global_tuning: Fine-tune each topic representation at timestamp t by averaging its c-TF-IDF matrix
843	with the global c-TF-IDF matrix. Turn this off if you want to prevent words in
844	topic representations that could not be found in the documents at timestamp t.
845
846	Returns:
847	topics_over_time: A dataframe that contains the topic, words, and frequency of topic
848	at timestamp t.
849
850	Examples:
851	The timestamps variable represents the timestamp of each document. If you have over
852	100 unique timestamps, it is advised to bin the timestamps as shown below:
853
854	```python

Callers 3

test_full_modelFunction · 0.80

test_dynamicFunction · 0.80

Calls 5

_c_tf_idfMethod · 0.95

_extract_words_per_topicMethod · 0.95

check_is_fittedFunction · 0.90

check_documents_typeFunction · 0.90

warningMethod · 0.80

Tested by 3

test_full_modelFunction · 0.64

test_dynamicFunction · 0.64