Generates a geographical data nullity heatmap, which shows the distribution of missing data across geographic regions. The precise output depends on the inputs provided. In increasing order of usefulness: * If no geographical context is provided, a quadtree is computed and nullities
(df, x=None, y=None, coordinates=None, by=None, geometry=None, cutoff=None, histogram=False,
figsize=(25, 10), fontsize=8, inline=True)
| 581 | |
| 582 | |
| 583 | def geoplot(df, x=None, y=None, coordinates=None, by=None, geometry=None, cutoff=None, histogram=False, |
| 584 | figsize=(25, 10), fontsize=8, inline=True): |
| 585 | """ |
| 586 | Generates a geographical data nullity heatmap, which shows the distribution of missing data across geographic |
| 587 | regions. The precise output depends on the inputs provided. In increasing order of usefulness: |
| 588 | |
| 589 | * If no geographical context is provided, a quadtree is computed and nullities are rendered as abstract |
| 590 | geopgrahical squares. |
| 591 | * If geographical context is provided in the form of a column of geographies (region, borough. ZIP code, |
| 592 | etc.) in the `DataFrame`, convex hulls are computed for each of the point groups and the heatmap is generated |
| 593 | within them. |
| 594 | * If geographical context is provided *and* a separate geometry is provided, a heatmap is generated for each |
| 595 | point group within this geograpby instead. |
| 596 | |
| 597 | :param df: The DataFrame whose completeness is being mapped. |
| 598 | :param x: The x variable: probably a coordinate (longitude), possibly some other floating point value. May be a |
| 599 | string (pointing to a column of df) or an iterable. |
| 600 | :param y: The y variable: probably a coordinate (latitude), possibly some other floating point value. May be a |
| 601 | string (pointing to a column of df) or an iterable. |
| 602 | :param coordinates: A coordinate tuple iterable, or column thereof in the given DataFrame. One of x AND y OR |
| 603 | coordinates must be specified, but not both. |
| 604 | :param by: If you would like to aggregate your geometry by some geospatial attribute of the underlying DataFrame, |
| 605 | name that column here. |
| 606 | :param geometry: If you would like to provide your own geometries for your aggregation, instead of relying on |
| 607 | (functional, but not pretty) convex hulls, provide them here. This parameter is expected to be a dict or Series |
| 608 | of `shapely.Polygon` or `shapely.MultiPolygon` objects. It's ignored if `by` is not specified. |
| 609 | :param cutoff: If no aggregation is specified, this parameter sets the minimum number of observations to include in |
| 610 | each square. If not provided, set to 50 or 5% of the total size of the dataset, whichever is smaller. If `by` is |
| 611 | specified this parameter is ignored. |
| 612 | :param figsize: The size of the figure to display. This is a `matplotlib` parameter which defaults to (25, 10). |
| 613 | :param histogram: Whether or not to plot a histogram of data distributions below the map. Defaults to False. |
| 614 | :param fontsize: If `hist` is specified, this parameter specifies the size of the tick labels. Ignored if `hist` |
| 615 | is not specified. Defaults to 8. |
| 616 | :param inline: Whether or not the figure is inline. If it's not then instead of getting plotted, this method will |
| 617 | return its figure. |
| 618 | :return: If `inline` is True, the underlying `matplotlib.figure` object. Else, nothing. |
| 619 | """ |
| 620 | import shapely.geometry |
| 621 | import descartes |
| 622 | import matplotlib.cm |
| 623 | # We produce a coordinate column in-place in a function-local copy of the `DataFrame`. |
| 624 | # This seems specious, and sort of is, but is necessary because the internal `pandas` aggregation methods |
| 625 | # (`pd.core.groupby.DataFrameGroupBy.count` specifically) are optimized to run two orders of magnitude faster than |
| 626 | # user-defined external `groupby` operations. For example: |
| 627 | # >>> %time df.head(100000).groupby(lambda ind: df.iloc[ind]['LOCATION']).count() |
| 628 | # Wall time: 12.7 s |
| 629 | # >>> %time df.head(100000).groupby('LOCATION').count() |
| 630 | # Wall time: 96 ms |
| 631 | x_col = '__x' |
| 632 | y_col = '__y' |
| 633 | if x and y: |
| 634 | if isinstance(x, str) and isinstance(y, str): |
| 635 | x_col = x |
| 636 | y_col = y |
| 637 | else: |
| 638 | df['__x'] = x |
| 639 | df['__y'] = y |
| 640 | elif coordinates: |
nothing calls this directly
no test coverage detected