For this data processing pipeline, reading node files is not needed. All the needed information about the nodes can be found in the metadata json file. This function generates the nodes owned by a given process, using metis partitions. Parameters: ----------- rank : int
(
rank, world_size, num_parts, id_lookup, ntid_ntype_map, schema_map
)
| 44 | |
| 45 | |
| 46 | def gen_node_data( |
| 47 | rank, world_size, num_parts, id_lookup, ntid_ntype_map, schema_map |
| 48 | ): |
| 49 | """ |
| 50 | For this data processing pipeline, reading node files is not needed. All the needed information about |
| 51 | the nodes can be found in the metadata json file. This function generates the nodes owned by a given |
| 52 | process, using metis partitions. |
| 53 | |
| 54 | Parameters: |
| 55 | ----------- |
| 56 | rank : int |
| 57 | rank of the process |
| 58 | world_size : int |
| 59 | total no. of processes |
| 60 | num_parts : int |
| 61 | total no. of partitions |
| 62 | id_lookup : instance of class DistLookupService |
| 63 | Distributed lookup service used to map global-nids to respective partition-ids and |
| 64 | shuffle-global-nids |
| 65 | ntid_ntype_map : |
| 66 | a dictionary where keys are node_type ids(integers) and values are node_type names(strings). |
| 67 | schema_map: |
| 68 | dictionary formed by reading the input metadata json file for the input dataset. |
| 69 | |
| 70 | Please note that, it is assumed that for the input graph files, the nodes of a particular node-type are |
| 71 | split into `p` files (because of `p` partitions to be generated). On a similar node, edges of a particular |
| 72 | edge-type are split into `p` files as well. |
| 73 | |
| 74 | #assuming m nodetypes present in the input graph |
| 75 | "num_nodes_per_chunk" : [ |
| 76 | [a0, a1, a2, ... a<p-1>], |
| 77 | [b0, b1, b2, ... b<p-1>], |
| 78 | ... |
| 79 | [m0, m1, m2, ... m<p-1>] |
| 80 | ] |
| 81 | Here, each sub-list, corresponding a nodetype in the input graph, has `p` elements. For instance [a0, a1, ... a<p-1>] |
| 82 | where each element represents the number of nodes which are to be processed by a process during distributed partitioning. |
| 83 | |
| 84 | In addition to the above key-value pair for the nodes in the graph, the node-features are captured in the |
| 85 | "node_data" key-value pair. In this dictionary the keys will be nodetype names and value will be a dictionary which |
| 86 | is used to capture all the features present for that particular node-type. This is shown in the following example: |
| 87 | |
| 88 | "node_data" : { |
| 89 | "paper": { # node type |
| 90 | "feat": { # feature key |
| 91 | "format": {"name": "numpy"}, |
| 92 | "data": ["node_data/paper-feat-part1.npy", "node_data/paper-feat-part2.npy"] |
| 93 | }, |
| 94 | "label": { # feature key |
| 95 | "format": {"name": "numpy"}, |
| 96 | "data": ["node_data/paper-label-part1.npy", "node_data/paper-label-part2.npy"] |
| 97 | }, |
| 98 | "year": { # feature key |
| 99 | "format": {"name": "numpy"}, |
| 100 | "data": ["node_data/paper-year-part1.npy", "node_data/paper-year-part2.npy"] |
| 101 | } |
| 102 | } |
| 103 | } |
no test coverage detected