MCPcopy
hub / github.com/MaartenGr/BERTopic / update_topics

Method update_topics

bertopic/_bertopic.py:1488–1596  ·  view source on GitHub ↗

Updates the topic representation by recalculating c-TF-IDF with the new parameters as defined in this function. When you have trained a model and viewed the topics and the words that represent them, you might not be satisfied with the representation. Perhaps you forgot to re

(
        self,
        docs: List[str],
        images: List[str] | None = None,
        topics: List[int] | None = None,
        top_n_words: int = 10,
        n_gram_range: Tuple[int, int] | None = None,
        vectorizer_model: CountVectorizer = None,
        ctfidf_model: ClassTfidfTransformer = None,
        representation_model: BaseRepresentation = None,
    )

Source from the content-addressed store, hash-verified

1486 return similar_topics, similarity
1487
1488 def update_topics(
1489 self,
1490 docs: List[str],
1491 images: List[str] | None = None,
1492 topics: List[int] | None = None,
1493 top_n_words: int = 10,
1494 n_gram_range: Tuple[int, int] | None = None,
1495 vectorizer_model: CountVectorizer = None,
1496 ctfidf_model: ClassTfidfTransformer = None,
1497 representation_model: BaseRepresentation = None,
1498 ):
1499 """Updates the topic representation by recalculating c-TF-IDF with the new
1500 parameters as defined in this function.
1501
1502 When you have trained a model and viewed the topics and the words that represent them,
1503 you might not be satisfied with the representation. Perhaps you forgot to remove
1504 stop_words or you want to try out a different n_gram_range. This function allows you
1505 to update the topic representation after they have been formed.
1506
1507 Arguments:
1508 docs: The documents you used when calling either `fit` or `fit_transform`
1509 images: The images you used when calling either `fit` or `fit_transform`
1510 topics: A list of topics where each topic is related to a document in `docs`.
1511 Use this variable to change or map the topics.
1512 NOTE: Using a custom list of topic assignments may lead to errors if
1513 topic reduction techniques are used afterwards. Make sure that
1514 manually assigning topics is the last step in the pipeline
1515 top_n_words: The number of words per topic to extract. Setting this
1516 too high can negatively impact topic embeddings as topics
1517 are typically best represented by at most 10 words.
1518 n_gram_range: The n-gram range for the CountVectorizer.
1519 vectorizer_model: Pass in your own CountVectorizer from scikit-learn
1520 ctfidf_model: Pass in your own c-TF-IDF model to update the representations
1521 representation_model: Pass in a model that fine-tunes the topic representations
1522 calculated through c-TF-IDF. Models from `bertopic.representation`
1523 are supported.
1524
1525 Examples:
1526 In order to update the topic representation, you will need to first fit the topic
1527 model and extract topics from them. Based on these, you can update the representation:
1528
1529 ```python
1530 topic_model.update_topics(docs, n_gram_range=(2, 3))
1531 ```
1532
1533 You can also use a custom vectorizer to update the representation:
1534
1535 ```python
1536 from sklearn.feature_extraction.text import CountVectorizer
1537 vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
1538 topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
1539 ```
1540
1541 You can also use this function to change or map the topics to something else.
1542 You can update them as follows:
1543
1544 ```python
1545 topic_model.update_topics(docs, my_updated_topics)

Callers 3

test_full_modelFunction · 0.80
test_online_cvFunction · 0.80
test_update_topicsFunction · 0.80

Calls 8

_update_topic_sizeMethod · 0.95
_c_tf_idfMethod · 0.95
_create_topic_vectorsMethod · 0.95
check_documents_typeFunction · 0.90
check_is_fittedFunction · 0.90
warningMethod · 0.80

Tested by 3

test_full_modelFunction · 0.64
test_online_cvFunction · 0.64
test_update_topicsFunction · 0.64