Calculate the number of documents in a corpus that contain a given term @params : term, the term to search each document for, and corpus, a collection of documents. Each document should be separated by a newline. @returns : the number of documents in the corpus that con
(term: str, corpus: str)
| 62 | |
| 63 | |
| 64 | def document_frequency(term: str, corpus: str) -> tuple[int, int]: |
| 65 | """ |
| 66 | Calculate the number of documents in a corpus that contain a |
| 67 | given term |
| 68 | @params : term, the term to search each document for, and corpus, a collection of |
| 69 | documents. Each document should be separated by a newline. |
| 70 | @returns : the number of documents in the corpus that contain the term you are |
| 71 | searching for and the number of documents in the corpus |
| 72 | @examples : |
| 73 | >>> document_frequency("first", "This is the first document in the corpus.\\nThIs\ |
| 74 | is the second document in the corpus.\\nTHIS is \ |
| 75 | the third document in the corpus.") |
| 76 | (1, 3) |
| 77 | """ |
| 78 | corpus_without_punctuation = corpus.lower().translate( |
| 79 | str.maketrans("", "", string.punctuation) |
| 80 | ) # strip all punctuation and replace it with '' |
| 81 | docs = corpus_without_punctuation.split("\n") |
| 82 | term = term.lower() |
| 83 | return (len([doc for doc in docs if term in doc]), len(docs)) |
| 84 | |
| 85 | |
| 86 | def inverse_document_frequency(df: int, n: int, smoothing=False) -> float: |