Function to create a TF-IDF list of dictionaries for a corpus of docs. If you opt for dumping the data, you can provide a file_path with .tfidfpkl extension(standard made for better understanding) and also re-generate a new tfidf list which overrides over an old one by mentioning its path.
(file_names=None, prev_file_path=None, dump_path=None)
| 71 | |
| 72 | |
| 73 | def find_tf_idf(file_names=None, prev_file_path=None, dump_path=None): |
| 74 | """Function to create a TF-IDF list of dictionaries for a corpus of docs. |
| 75 | If you opt for dumping the data, you can provide a file_path with .tfidfpkl extension(standard made for better understanding) |
| 76 | and also re-generate a new tfidf list which overrides over an old one by mentioning its path. |
| 77 | |
| 78 | @Args: |
| 79 | -- |
| 80 | file_names : paths of files to be processed on, you can give many small sized file, rather than one large file. |
| 81 | prev_file_path : path of old .tfidfpkl file, if available. (default=None) |
| 82 | dump_path : directory-path where to dump generated lists.(default=None) |
| 83 | |
| 84 | @returns: |
| 85 | -- |
| 86 | idf : a dict of unique words in corpus,with their document frequency as values. |
| 87 | tf_idf : the generated tf-idf list of dictionaries for mentioned docs. |
| 88 | """ |
| 89 | if file_names is None: |
| 90 | file_names = ["./../test/testdata"] |
| 91 | tf_idf = [] # will hold a dict of word_count for every doc(line in a doc in this case) |
| 92 | idf = {} |
| 93 | |
| 94 | # this statement is useful for altering existant tf-idf file and adding new docs in itself.(## memory is now the biggest issue) |
| 95 | if prev_file_path: |
| 96 | print(TAG, "modifying over exising file.. @", prev_file_path) |
| 97 | idf, tf_idf = pickle.load(open(prev_file_path, "rb")) |
| 98 | prev_doc_count = len(idf) |
| 99 | prev_corpus_length = len(tf_idf) |
| 100 | |
| 101 | for f in file_names: |
| 102 | file1 = open( |
| 103 | f, "r" |
| 104 | ) # never use 'rb' for textual data, it creates something like, {b'line-inside-the-doc'} |
| 105 | |
| 106 | # create word_count dict for all docs |
| 107 | for line in file1: |
| 108 | dict = {} |
| 109 | # find the amount of doc a word is in |
| 110 | for i in set(line.split()): |
| 111 | if i in idf: |
| 112 | idf[i] += 1 |
| 113 | else: |
| 114 | idf[i] = 1 |
| 115 | for word in line.split(): |
| 116 | # find the count of all words in every doc |
| 117 | if word not in dict: |
| 118 | dict[word] = 1 |
| 119 | else: |
| 120 | dict[word] += 1 |
| 121 | tf_idf.append(dict) |
| 122 | file1.close() |
| 123 | |
| 124 | # calculating final TF-IDF values for all words in all docs(line in a doc in this case) |
| 125 | for doc in tf_idf: |
| 126 | for key in doc: |
| 127 | true_idf = math.log(len(tf_idf) / idf[key]) |
| 128 | true_tf = doc[key] / len(doc) |
| 129 | doc[key] = true_tf * true_idf |
| 130 |