MCPcopy
hub / github.com/DedSecInside/TorBot / write_data

Function write_data

torbot/modules/nlp/gather_data.py:9–33  ·  view source on GitHub ↗

Writes the training data from the csv file to a directory based on the scikit-learn.datasets `load_files` specification. dataset source: https://www.kaggle.com/hetulmehta/website-classification e.g. container_folder/ category_1_folder/ file_1.tx

()

Source from the content-addressed store, hash-verified

7
8
9def write_data():
10 """
11 Writes the training data from the csv file to a directory based on the
12 scikit-learn.datasets `load_files` specification.
13
14 dataset source: https://www.kaggle.com/hetulmehta/website-classification
15
16 e.g.
17 container_folder/
18 category_1_folder/
19 file_1.txt file_2.txt file_3.txt ... file_42.txt
20 category_2_folder/
21 file_43.txt file_44.txt ...
22 """
23
24 with open('website_classification.csv') as csvfile:
25 website_reader = csv.reader(csvfile, delimiter=',')
26 for row in website_reader:
27 [id, website, content, category] = row
28 if category != 'category':
29 category = category.replace('/', '+')
30 dir_name = f"training_data/{category}"
31 Path(dir_name).mkdir(parents=True, exist_ok=True)
32 with open(f'{dir_name}/{id}.txt', mode='w+') as txtfile:
33 txtfile.write(content)
34
35
36if __name__ == "__main__":

Callers 2

gather_data.pyFile · 0.85
classifyFunction · 0.85

Calls

no outgoing calls

Tested by

no test coverage detected