MCPcopy
hub / github.com/dask/dask / from_pandas

Function from_pandas

dask/dataframe/dask_expr/_collection.py:4870–4966  ·  view source on GitHub ↗

Construct a Dask DataFrame from a Pandas DataFrame This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel. By default, the input dataframe will be sorted by the index to prod

(data, npartitions=None, sort=True, chunksize=None)

Source from the content-addressed store, hash-verified

4868
4869
4870def from_pandas(data, npartitions=None, sort=True, chunksize=None):
4871 """
4872 Construct a Dask DataFrame from a Pandas DataFrame
4873
4874 This splits an in-memory Pandas dataframe into several parts and constructs
4875 a dask.dataframe from those parts on which Dask.dataframe can operate in
4876 parallel. By default, the input dataframe will be sorted by the index to
4877 produce cleanly-divided partitions (with known divisions). To preserve the
4878 input ordering, make sure the input index is monotonically-increasing. The
4879 ``sort=False`` option will also avoid reordering, but will not result in
4880 known divisions.
4881
4882 Parameters
4883 ----------
4884 data : pandas.DataFrame or pandas.Series
4885 The DataFrame/Series with which to construct a Dask DataFrame/Series
4886 npartitions : int, optional, default 1
4887 The number of partitions of the index to create. Note that if there
4888 are duplicate values or insufficient elements in ``data.index``, the
4889 output may have fewer partitions than requested.
4890 chunksize : int, optional
4891 The desired number of rows per index partition to use. Note that
4892 depending on the size and index of the dataframe, actual partition
4893 sizes may vary.
4894 sort: bool, default True
4895 Sort the input by index first to obtain cleanly divided partitions
4896 (with known divisions). If False, the input will not be sorted, and
4897 all divisions will be set to None. Default is True.
4898
4899 Returns
4900 -------
4901 dask.DataFrame or dask.Series
4902 A dask DataFrame/Series partitioned along the index
4903
4904 Examples
4905 --------
4906 >>> from dask.dataframe import from_pandas
4907 >>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))),
4908 ... index=pd.date_range(start='20100101', periods=6))
4909 >>> ddf = from_pandas(df, npartitions=3)
4910 >>> ddf.divisions # doctest: +NORMALIZE_WHITESPACE
4911 (Timestamp('2010-01-01 00:00:00'),
4912 Timestamp('2010-01-03 00:00:00'),
4913 Timestamp('2010-01-05 00:00:00'),
4914 Timestamp('2010-01-06 00:00:00'))
4915 >>> ddf = from_pandas(df.a, npartitions=3) # Works with Series too!
4916 >>> ddf.divisions # doctest: +NORMALIZE_WHITESPACE
4917 (Timestamp('2010-01-01 00:00:00'),
4918 Timestamp('2010-01-03 00:00:00'),
4919 Timestamp('2010-01-05 00:00:00'),
4920 Timestamp('2010-01-06 00:00:00'))
4921
4922 Raises
4923 ------
4924 TypeError
4925 If something other than a ``pandas.DataFrame`` or ``pandas.Series`` is
4926 passed in.
4927

Callers 15

dfFunction · 0.90
test_series_resampleFunction · 0.90
_maybe_from_pandasFunction · 0.90
test_to_parquetFunction · 0.90
test_partition_pruningFunction · 0.90
test_from_pandasFunction · 0.90
test_from_pandas_noargsFunction · 0.90
test_from_pandas_emptyFunction · 0.90

Calls 9

has_parallel_typeFunction · 0.90
new_collectionFunction · 0.90
FromPandasClass · 0.90
_BackendDataClass · 0.90
pyarrow_strings_enabledFunction · 0.90
anyMethod · 0.45
isnaMethod · 0.45
copyMethod · 0.45

Tested by 15

dfFunction · 0.72
test_series_resampleFunction · 0.72
test_to_parquetFunction · 0.72
test_partition_pruningFunction · 0.72
test_from_pandasFunction · 0.72
test_from_pandas_noargsFunction · 0.72
test_from_pandas_emptyFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…