Updates the topic representation by recalculating c-TF-IDF with the new parameters as defined in this function. When you have trained a model and viewed the topics and the words that represent them, you might not be satisfied with the representation. Perhaps you forgot to re
(
self,
docs: List[str],
images: List[str] | None = None,
topics: List[int] | None = None,
top_n_words: int = 10,
n_gram_range: Tuple[int, int] | None = None,
vectorizer_model: CountVectorizer = None,
ctfidf_model: ClassTfidfTransformer = None,
representation_model: BaseRepresentation = None,
)
| 1486 | return similar_topics, similarity |
| 1487 | |
| 1488 | def update_topics( |
| 1489 | self, |
| 1490 | docs: List[str], |
| 1491 | images: List[str] | None = None, |
| 1492 | topics: List[int] | None = None, |
| 1493 | top_n_words: int = 10, |
| 1494 | n_gram_range: Tuple[int, int] | None = None, |
| 1495 | vectorizer_model: CountVectorizer = None, |
| 1496 | ctfidf_model: ClassTfidfTransformer = None, |
| 1497 | representation_model: BaseRepresentation = None, |
| 1498 | ): |
| 1499 | """Updates the topic representation by recalculating c-TF-IDF with the new |
| 1500 | parameters as defined in this function. |
| 1501 | |
| 1502 | When you have trained a model and viewed the topics and the words that represent them, |
| 1503 | you might not be satisfied with the representation. Perhaps you forgot to remove |
| 1504 | stop_words or you want to try out a different n_gram_range. This function allows you |
| 1505 | to update the topic representation after they have been formed. |
| 1506 | |
| 1507 | Arguments: |
| 1508 | docs: The documents you used when calling either `fit` or `fit_transform` |
| 1509 | images: The images you used when calling either `fit` or `fit_transform` |
| 1510 | topics: A list of topics where each topic is related to a document in `docs`. |
| 1511 | Use this variable to change or map the topics. |
| 1512 | NOTE: Using a custom list of topic assignments may lead to errors if |
| 1513 | topic reduction techniques are used afterwards. Make sure that |
| 1514 | manually assigning topics is the last step in the pipeline |
| 1515 | top_n_words: The number of words per topic to extract. Setting this |
| 1516 | too high can negatively impact topic embeddings as topics |
| 1517 | are typically best represented by at most 10 words. |
| 1518 | n_gram_range: The n-gram range for the CountVectorizer. |
| 1519 | vectorizer_model: Pass in your own CountVectorizer from scikit-learn |
| 1520 | ctfidf_model: Pass in your own c-TF-IDF model to update the representations |
| 1521 | representation_model: Pass in a model that fine-tunes the topic representations |
| 1522 | calculated through c-TF-IDF. Models from `bertopic.representation` |
| 1523 | are supported. |
| 1524 | |
| 1525 | Examples: |
| 1526 | In order to update the topic representation, you will need to first fit the topic |
| 1527 | model and extract topics from them. Based on these, you can update the representation: |
| 1528 | |
| 1529 | ```python |
| 1530 | topic_model.update_topics(docs, n_gram_range=(2, 3)) |
| 1531 | ``` |
| 1532 | |
| 1533 | You can also use a custom vectorizer to update the representation: |
| 1534 | |
| 1535 | ```python |
| 1536 | from sklearn.feature_extraction.text import CountVectorizer |
| 1537 | vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english") |
| 1538 | topic_model.update_topics(docs, vectorizer_model=vectorizer_model) |
| 1539 | ``` |
| 1540 | |
| 1541 | You can also use this function to change or map the topics to something else. |
| 1542 | You can update them as follows: |
| 1543 | |
| 1544 | ```python |
| 1545 | topic_model.update_topics(docs, my_updated_topics) |