MCPcopy
hub / github.com/geekcomputers/Python / find_tf_idf

Function find_tf_idf

tf_idf_generator.py:73–160  ·  view source on GitHub ↗

Function to create a TF-IDF list of dictionaries for a corpus of docs. If you opt for dumping the data, you can provide a file_path with .tfidfpkl extension(standard made for better understanding) and also re-generate a new tfidf list which overrides over an old one by mentioning its path.

(file_names=None, prev_file_path=None, dump_path=None)

Source from the content-addressed store, hash-verified

71
72
73def find_tf_idf(file_names=None, prev_file_path=None, dump_path=None):
74 """Function to create a TF-IDF list of dictionaries for a corpus of docs.
75 If you opt for dumping the data, you can provide a file_path with .tfidfpkl extension(standard made for better understanding)
76 and also re-generate a new tfidf list which overrides over an old one by mentioning its path.
77
78 @Args:
79 --
80 file_names : paths of files to be processed on, you can give many small sized file, rather than one large file.
81 prev_file_path : path of old .tfidfpkl file, if available. (default=None)
82 dump_path : directory-path where to dump generated lists.(default=None)
83
84 @returns:
85 --
86 idf : a dict of unique words in corpus,with their document frequency as values.
87 tf_idf : the generated tf-idf list of dictionaries for mentioned docs.
88 """
89 if file_names is None:
90 file_names = ["./../test/testdata"]
91 tf_idf = [] # will hold a dict of word_count for every doc(line in a doc in this case)
92 idf = {}
93
94 # this statement is useful for altering existant tf-idf file and adding new docs in itself.(## memory is now the biggest issue)
95 if prev_file_path:
96 print(TAG, "modifying over exising file.. @", prev_file_path)
97 idf, tf_idf = pickle.load(open(prev_file_path, "rb"))
98 prev_doc_count = len(idf)
99 prev_corpus_length = len(tf_idf)
100
101 for f in file_names:
102 file1 = open(
103 f, "r"
104 ) # never use 'rb' for textual data, it creates something like, {b'line-inside-the-doc'}
105
106 # create word_count dict for all docs
107 for line in file1:
108 dict = {}
109 # find the amount of doc a word is in
110 for i in set(line.split()):
111 if i in idf:
112 idf[i] += 1
113 else:
114 idf[i] = 1
115 for word in line.split():
116 # find the count of all words in every doc
117 if word not in dict:
118 dict[word] = 1
119 else:
120 dict[word] += 1
121 tf_idf.append(dict)
122 file1.close()
123
124 # calculating final TF-IDF values for all words in all docs(line in a doc in this case)
125 for doc in tf_idf:
126 for key in doc:
127 true_idf = math.log(len(tf_idf) / idf[key])
128 true_tf = doc[key] / len(doc)
129 doc[key] = true_tf * true_idf
130

Callers

nothing calls this directly

Calls 4

paintFunction · 0.85
loadMethod · 0.45
appendMethod · 0.45
closeMethod · 0.45

Tested by

no test coverage detected