Implements the download of sample files from an AWS S3 bucket. ``Setup`` implements the download of sample files from an AWS S3 bucket. Currently, there are samples from eight domains: - AgreementsLarge (~80 sample contracts) - Agreements (~15 sample employment agreements) - UN
| 34 | |
| 35 | |
| 36 | class Setup: |
| 37 | |
| 38 | """Implements the download of sample files from an AWS S3 bucket. |
| 39 | |
| 40 | ``Setup`` implements the download of sample files from an AWS S3 bucket. Currently, there are samples |
| 41 | from eight domains: |
| 42 | |
| 43 | - AgreementsLarge (~80 sample contracts) |
| 44 | - Agreements (~15 sample employment agreements) |
| 45 | - UN-Resolutions-500 (500 United Nations Resolutions over ~2 years) |
| 46 | - Invoices (~40 invoice sample documents) |
| 47 | - FinDocs (~15 financial annual reports, earnings and 10Ks) |
| 48 | - AWS-Transcribe (~5 AWS-transcribe JSON files) |
| 49 | - SmallLibrary (~10 mixed document types for quick testing) |
| 50 | - Images (~3 images for OCR processing) |
| 51 | |
| 52 | The sample files are updated continously. By calling ``Setup().load_sample_files(over_write=True)`` |
| 53 | you will get the newest version of the sample files. |
| 54 | |
| 55 | The sample files were prepared by LLMWare from public domain materials, or invented bespoke. |
| 56 | If you have any concerns about Personally Identifiable Information (PII), or the suitability of any material |
| 57 | we included, please contact us, e.g. either by raising an issue on GitHub or sending an E-Mail. |
| 58 | We reserve the right to withdraw documents at any time. |
| 59 | |
| 60 | Examples |
| 61 | ---------- |
| 62 | >>> import os |
| 63 | >>> from llmware.setup import Setup |
| 64 | >>> sample_files_path = Setup().load_sample_files() |
| 65 | >>> sample_files_path |
| 66 | '/home/user/llmware_data/sample_files' |
| 67 | >>> os.listdir(sample_files_path) |
| 68 | ['AWS-Transcribe', '.DS_Store', 'SmallLibrary', 'UN-Resolutions-500', 'Invoices', 'Images', 'AgreementsLarge', 'Agreements', 'FinDocs'] |
| 69 | |
| 70 | If you have called the function before but want to get the newest updates to the sample files, or you simply |
| 71 | want to get the newest sample files, you simply set ``over_write=True``. |
| 72 | >>> sample_files_path = Setup().load_sample_files(over_write=True) |
| 73 | """ |
| 74 | @staticmethod |
| 75 | def load_sample_files(over_write=False): |
| 76 | |
| 77 | """ Downloads sample document files from non-restricted AWS S3 bucket. """ |
| 78 | |
| 79 | if not os.path.exists(LLMWareConfig.get_llmware_path()): |
| 80 | LLMWareConfig.setup_llmware_workspace() |
| 81 | |
| 82 | # not configurable - will pull into /sample_files under llmware_path |
| 83 | sample_files_path = os.path.join(LLMWareConfig.get_llmware_path(), "sample_files") |
| 84 | |
| 85 | if not os.path.exists(sample_files_path): |
| 86 | os.makedirs(sample_files_path,exist_ok=True) |
| 87 | else: |
| 88 | if not over_write: |
| 89 | logger.info(f"Setup - sample_files path already exists - {sample_files_path}") |
| 90 | return sample_files_path |
| 91 | |
| 92 | # pull from sample files bucket |
| 93 | logger.info(f"Setup - sample_files - downloading requested sample files from AWS S3 bucket - may take a minute.") |
no outgoing calls