hub / github.com/dask/dask / from_pandas

Function from_pandas

dask/dataframe/dask_expr/_collection.py:4870–4966 · view source on GitHub ↗

Construct a Dask DataFrame from a Pandas DataFrame This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel. By default, the input dataframe will be sorted by the index to prod

(data, npartitions=None, sort=True, chunksize=None)

Source from the content-addressed store, hash-verified

4868
4869
4870	def from_pandas(data, npartitions=None, sort=True, chunksize=None):
4871	"""
4872	Construct a Dask DataFrame from a Pandas DataFrame
4873
4874	This splits an in-memory Pandas dataframe into several parts and constructs
4875	a dask.dataframe from those parts on which Dask.dataframe can operate in
4876	parallel. By default, the input dataframe will be sorted by the index to
4877	produce cleanly-divided partitions (with known divisions). To preserve the
4878	input ordering, make sure the input index is monotonically-increasing. The
4879	``sort=False`` option will also avoid reordering, but will not result in
4880	known divisions.
4881
4882	Parameters
4883	----------
4884	data : pandas.DataFrame or pandas.Series
4885	The DataFrame/Series with which to construct a Dask DataFrame/Series
4886	npartitions : int, optional, default 1
4887	The number of partitions of the index to create. Note that if there
4888	are duplicate values or insufficient elements in ``data.index``, the
4889	output may have fewer partitions than requested.
4890	chunksize : int, optional
4891	The desired number of rows per index partition to use. Note that
4892	depending on the size and index of the dataframe, actual partition
4893	sizes may vary.
4894	sort: bool, default True
4895	Sort the input by index first to obtain cleanly divided partitions
4896	(with known divisions). If False, the input will not be sorted, and
4897	all divisions will be set to None. Default is True.
4898
4899	Returns
4900	-------
4901	dask.DataFrame or dask.Series
4902	A dask DataFrame/Series partitioned along the index
4903
4904	Examples
4905	--------
4906	>>> from dask.dataframe import from_pandas
4907	>>> df = pd.DataFrame(dict(a=list('aabbcc'), b=list(range(6))),
4908	... index=pd.date_range(start='20100101', periods=6))
4909	>>> ddf = from_pandas(df, npartitions=3)
4910	>>> ddf.divisions # doctest: +NORMALIZE_WHITESPACE
4911	(Timestamp('2010-01-01 00:00:00'),
4912	Timestamp('2010-01-03 00:00:00'),
4913	Timestamp('2010-01-05 00:00:00'),
4914	Timestamp('2010-01-06 00:00:00'))
4915	>>> ddf = from_pandas(df.a, npartitions=3) # Works with Series too!
4916	>>> ddf.divisions # doctest: +NORMALIZE_WHITESPACE
4917	(Timestamp('2010-01-01 00:00:00'),
4918	Timestamp('2010-01-03 00:00:00'),
4919	Timestamp('2010-01-05 00:00:00'),
4920	Timestamp('2010-01-06 00:00:00'))
4921
4922	Raises
4923	------
4924	TypeError
4925	If something other than a ``pandas.DataFrame`` or ``pandas.Series`` is
4926	passed in.
4927

Callers 15

dfFunction · 0.90

test_series_resampleFunction · 0.90

test_resample_has_correct_fill_valueFunction · 0.90

test_resample_divisions_propagationFunction · 0.90

_maybe_from_pandasFunction · 0.90

test_to_parquetFunction · 0.90

test_partition_pruningFunction · 0.90

test_aggregate_rg_stats_to_fileFunction · 0.90

test_tune_optimization_disabledFunction · 0.90

test_from_pandasFunction · 0.90

test_from_pandas_noargsFunction · 0.90

test_from_pandas_emptyFunction · 0.90

Calls 9

has_parallel_typeFunction · 0.90

_is_any_real_numeric_dtypeFunction · 0.90

new_collectionFunction · 0.90

FromPandasClass · 0.90

_BackendDataClass · 0.90

pyarrow_strings_enabledFunction · 0.90

anyMethod · 0.45

isnaMethod · 0.45

copyMethod · 0.45

Tested by 15

dfFunction · 0.72

test_series_resampleFunction · 0.72

test_resample_has_correct_fill_valueFunction · 0.72

test_resample_divisions_propagationFunction · 0.72

test_to_parquetFunction · 0.72

test_partition_pruningFunction · 0.72

test_aggregate_rg_stats_to_fileFunction · 0.72

test_tune_optimization_disabledFunction · 0.72

test_from_pandasFunction · 0.72

test_from_pandas_noargsFunction · 0.72

test_from_pandas_emptyFunction · 0.72

test_from_pandas_immutableFunction · 0.72

Used in the wild real call sites across dependent graphs

searching dependent graphs…