Construct a Dask DataFrame from a Pandas DataFrame This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel. By default, the input dataframe will be sorted by the index to prod
(data, npartitions=None, sort=True, chunksize=None)
| 4868 | |
| 4869 | |
| 4870 | def from_pandas(data, npartitions=None, sort=True, chunksize=None): |
| 4871 | """ |
| 4872 | Construct a Dask DataFrame from a Pandas DataFrame |
| 4873 | |
| 4874 | This splits an in-memory Pandas dataframe into several parts and constructs |
| 4875 | a dask.dataframe from those parts on which Dask.dataframe can operate in |
| 4876 | parallel. By default, the input dataframe will be sorted by the index to |
| 4877 | produce cleanly-divided partitions (with known divisions). To preserve the |
| 4878 | input ordering, make sure the input index is monotonically-increasing. The |
| 4879 | ``sort=False`` option will also avoid reordering, but will not result in |
| 4880 | known divisions. |
| 4881 | |
| 4882 | Parameters |
| 4883 | ---------- |
| 4884 | data : pandas.DataFrame or pandas.Series |
| 4885 | The DataFrame/Series with which to construct a Dask DataFrame/Series |
| 4886 | npartitions : int, optional, default 1 |
| 4887 | The number of partitions of the index to create. Note that if there |
| 4888 | are duplicate values or insufficient elements in ``data.index``, the |
| 4889 | output may have fewer partitions than requested. |
| 4890 | chunksize : int, optional |
| 4891 | The desired number of rows per index partition to use. Note that |
| 4892 | depending on the size and index of the dataframe, actual partition |
| 4893 | sizes may vary. |
| 4894 | sort: bool, default True |
| 4895 | Sort the input by index first to obtain cleanly divided partitions |
| 4896 | (with known divisions). If False, the input will not be sorted, and |
| 4897 | all divisions will be set to None. Default is True. |
| 4898 | |
| 4899 | Returns |
| 4900 | ------- |
| 4901 | dask.DataFrame or dask.Series |
| 4902 | A dask DataFrame/Series partitioned along the index |
| 4903 | |
| 4904 | Examples |
| 4905 | -------- |
| 4906 | >>> from dask.dataframe import from_pandas |
| 4907 | >>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))), |
| 4908 | ... index=pd.date_range(start='20100101', periods=6)) |
| 4909 | >>> ddf = from_pandas(df, npartitions=3) |
| 4910 | >>> ddf.divisions # doctest: +NORMALIZE_WHITESPACE |
| 4911 | (Timestamp('2010-01-01 00:00:00'), |
| 4912 | Timestamp('2010-01-03 00:00:00'), |
| 4913 | Timestamp('2010-01-05 00:00:00'), |
| 4914 | Timestamp('2010-01-06 00:00:00')) |
| 4915 | >>> ddf = from_pandas(df.a, npartitions=3) # Works with Series too! |
| 4916 | >>> ddf.divisions # doctest: +NORMALIZE_WHITESPACE |
| 4917 | (Timestamp('2010-01-01 00:00:00'), |
| 4918 | Timestamp('2010-01-03 00:00:00'), |
| 4919 | Timestamp('2010-01-05 00:00:00'), |
| 4920 | Timestamp('2010-01-06 00:00:00')) |
| 4921 | |
| 4922 | Raises |
| 4923 | ------ |
| 4924 | TypeError |
| 4925 | If something other than a ``pandas.DataFrame`` or ``pandas.Series`` is |
| 4926 | passed in. |
| 4927 |
searching dependent graphs…