Formats an :class:`~anndata.AnnData` score as a tidy DataFrame. This function converts the score output from an AnnData object into a long-format pandas DataFrame, where each row represents: - For non-gene-centric variant scorers: Score for a variant-track pair. - For non-gene-centric
(
adata: anndata.AnnData,
match_gene_strand: bool = True,
include_extended_metadata: bool = True,
)
| 651 | |
| 652 | |
| 653 | def tidy_anndata( |
| 654 | adata: anndata.AnnData, |
| 655 | match_gene_strand: bool = True, |
| 656 | include_extended_metadata: bool = True, |
| 657 | ) -> pd.DataFrame: |
| 658 | """Formats an :class:`~anndata.AnnData` score as a tidy DataFrame. |
| 659 | |
| 660 | This function converts the score output from an AnnData object into a |
| 661 | long-format pandas DataFrame, where each row represents: |
| 662 | |
| 663 | - For non-gene-centric variant scorers: Score for a variant-track pair. |
| 664 | - For non-gene-centric interval scorers: Score for an interval-track pair. |
| 665 | - For gene-centric variant/interval scoring: Score for a |
| 666 | variant/interval-gene-track combination. |
| 667 | |
| 668 | Args: |
| 669 | adata: An AnnData object containing scores. |
| 670 | match_gene_strand: If True (and using gene-centric scoring), rows with |
| 671 | mismatched gene and track strands are removed. |
| 672 | include_extended_metadata: If True, includes additional columns derived from |
| 673 | metadata specific to the output type, such as biosample name and type, |
| 674 | gtex tissue, transcription factor, and histone mark, if available. If |
| 675 | False, only includes minimal metadata columns required to unique identify |
| 676 | a track withing a given output type: track_name and track_strand. |
| 677 | |
| 678 | Returns: |
| 679 | A pandas DataFrame with one score per row. The DataFrame includes |
| 680 | columns for variant ID (if applicable), scored interval, gene information |
| 681 | (if applicable), output type, variant/interval scorer, track name, |
| 682 | ontology term, assay type, track strand, and raw score. Additional metadata |
| 683 | such as biosample name and type, gtex tissue are also returned (where |
| 684 | available). See :func:`full_path_to.tidy_scores` for more details on the |
| 685 | returned columns. |
| 686 | |
| 687 | Raises: |
| 688 | ValueError: If the input is not an AnnData object. |
| 689 | """ |
| 690 | if not isinstance(adata, anndata.AnnData): |
| 691 | raise ValueError('Invalid input type. Must be an AnnData object.') |
| 692 | |
| 693 | # Columns to include from the gene metadata. If the column did not |
| 694 | # exist in the original metadata (or if the scores do not have a concept |
| 695 | # of a gene), a column of all Nones is added. |
| 696 | gene_columns = [ |
| 697 | 'gene_id', |
| 698 | 'gene_name', |
| 699 | 'gene_type', |
| 700 | 'gene_strand', |
| 701 | 'junction_Start', |
| 702 | 'junction_End', |
| 703 | ] |
| 704 | if math.prod(adata.X.shape) == 0: |
| 705 | # Scores are empty, so we return an empty dataframe. |
| 706 | return pd.DataFrame() |
| 707 | elif 'gene_id' in adata.obs and 'strand' in adata.obs: |
| 708 | # Scores are for a gene-based scorer. |
| 709 | obs = adata.obs.rename({'strand': 'gene_strand'}, axis=1) |
| 710 |
no test coverage detected