Load IMDB dataset. Parameters ---------- path : str The path that the data is downloaded to, defaults is ``data/imdb/``. nb_words : int Number of words to get. skip_top : int Top most frequent words to ignore (they will appear as oov_char value in the seq
(
path='data', nb_words=None, skip_top=0, maxlen=None, test_split=0.2, seed=113, start_char=1, oov_char=2,
index_from=3
)
| 842 | |
| 843 | |
| 844 | def load_imdb_dataset( |
| 845 | path='data', nb_words=None, skip_top=0, maxlen=None, test_split=0.2, seed=113, start_char=1, oov_char=2, |
| 846 | index_from=3 |
| 847 | ): |
| 848 | """Load IMDB dataset. |
| 849 | |
| 850 | Parameters |
| 851 | ---------- |
| 852 | path : str |
| 853 | The path that the data is downloaded to, defaults is ``data/imdb/``. |
| 854 | nb_words : int |
| 855 | Number of words to get. |
| 856 | skip_top : int |
| 857 | Top most frequent words to ignore (they will appear as oov_char value in the sequence data). |
| 858 | maxlen : int |
| 859 | Maximum sequence length. Any longer sequence will be truncated. |
| 860 | seed : int |
| 861 | Seed for reproducible data shuffling. |
| 862 | start_char : int |
| 863 | The start of a sequence will be marked with this character. Set to 1 because 0 is usually the padding character. |
| 864 | oov_char : int |
| 865 | Words that were cut out because of the num_words or skip_top limit will be replaced with this character. |
| 866 | index_from : int |
| 867 | Index actual words with this index and higher. |
| 868 | |
| 869 | Examples |
| 870 | -------- |
| 871 | >>> X_train, y_train, X_test, y_test = tl.files.load_imdb_dataset( |
| 872 | ... nb_words=20000, test_split=0.2) |
| 873 | >>> print('X_train.shape', X_train.shape) |
| 874 | (20000,) [[1, 62, 74, ... 1033, 507, 27],[1, 60, 33, ... 13, 1053, 7]..] |
| 875 | >>> print('y_train.shape', y_train.shape) |
| 876 | (20000,) [1 0 0 ..., 1 0 1] |
| 877 | |
| 878 | References |
| 879 | ----------- |
| 880 | - `Modified from keras. <https://github.com/fchollet/keras/blob/master/keras/datasets/imdb.py>`__ |
| 881 | |
| 882 | """ |
| 883 | path = os.path.join(path, 'imdb') |
| 884 | |
| 885 | filename = "imdb.pkl" |
| 886 | url = 'https://s3.amazonaws.com/text-datasets/' |
| 887 | maybe_download_and_extract(filename, path, url) |
| 888 | |
| 889 | if filename.endswith(".gz"): |
| 890 | f = gzip.open(os.path.join(path, filename), 'rb') |
| 891 | else: |
| 892 | f = open(os.path.join(path, filename), 'rb') |
| 893 | |
| 894 | X, labels = cPickle.load(f) |
| 895 | f.close() |
| 896 | |
| 897 | np.random.seed(seed) |
| 898 | np.random.shuffle(X) |
| 899 | np.random.seed(seed) |
| 900 | np.random.shuffle(labels) |
| 901 |
nothing calls this directly
no test coverage detected
searching dependent graphs…