hub / github.com/geekcomputers/Python / find_tf_idf

Function find_tf_idf

tf_idf_generator.py:73–160 · view source on GitHub ↗

Function to create a TF-IDF list of dictionaries for a corpus of docs. If you opt for dumping the data, you can provide a file_path with .tfidfpkl extension(standard made for better understanding) and also re-generate a new tfidf list which overrides over an old one by mentioning its path.

(file_names=None, prev_file_path=None, dump_path=None)

Source from the content-addressed store, hash-verified

71
72
73	def find_tf_idf(file_names=None, prev_file_path=None, dump_path=None):
74	"""Function to create a TF-IDF list of dictionaries for a corpus of docs.
75	If you opt for dumping the data, you can provide a file_path with .tfidfpkl extension(standard made for better understanding)
76	and also re-generate a new tfidf list which overrides over an old one by mentioning its path.
77
78	@Args:
79	--
80	file_names : paths of files to be processed on, you can give many small sized file, rather than one large file.
81	prev_file_path : path of old .tfidfpkl file, if available. (default=None)
82	dump_path : directory-path where to dump generated lists.(default=None)
83
84	@returns:
85	--
86	idf : a dict of unique words in corpus,with their document frequency as values.
87	tf_idf : the generated tf-idf list of dictionaries for mentioned docs.
88	"""
89	if file_names is None:
90	file_names = ["./../test/testdata"]
91	tf_idf = [] # will hold a dict of word_count for every doc(line in a doc in this case)
92	idf = {}
93
94	# this statement is useful for altering existant tf-idf file and adding new docs in itself.(## memory is now the biggest issue)
95	if prev_file_path:
96	print(TAG, "modifying over exising file.. @", prev_file_path)
97	idf, tf_idf = pickle.load(open(prev_file_path, "rb"))
98	prev_doc_count = len(idf)
99	prev_corpus_length = len(tf_idf)
100
101	for f in file_names:
102	file1 = open(
103	f, "r"
104	) # never use 'rb' for textual data, it creates something like, {b'line-inside-the-doc'}
105
106	# create word_count dict for all docs
107	for line in file1:
108	dict = {}
109	# find the amount of doc a word is in
110	for i in set(line.split()):
111	if i in idf:
112	idf[i] += 1
113	else:
114	idf[i] = 1
115	for word in line.split():
116	# find the count of all words in every doc
117	if word not in dict:
118	dict[word] = 1
119	else:
120	dict[word] += 1
121	tf_idf.append(dict)
122	file1.close()
123
124	# calculating final TF-IDF values for all words in all docs(line in a doc in this case)
125	for doc in tf_idf:
126	for key in doc:
127	true_idf = math.log(len(tf_idf) / idf[key])
128	true_tf = doc[key] / len(doc)
129	doc[key] = true_tf * true_idf
130

Callers

nothing calls this directly

Calls 4

paintFunction · 0.85

loadMethod · 0.45

appendMethod · 0.45

closeMethod · 0.45

Tested by

no test coverage detected