MCPcopy
hub / github.com/google-deepmind/alphagenome / tidy_anndata

Function tidy_anndata

src/alphagenome/models/variant_scorers.py:653–778  ·  view source on GitHub ↗

Formats an :class:`~anndata.AnnData` score as a tidy DataFrame. This function converts the score output from an AnnData object into a long-format pandas DataFrame, where each row represents: - For non-gene-centric variant scorers: Score for a variant-track pair. - For non-gene-centric

(
    adata: anndata.AnnData,
    match_gene_strand: bool = True,
    include_extended_metadata: bool = True,
)

Source from the content-addressed store, hash-verified

651
652
653def tidy_anndata(
654 adata: anndata.AnnData,
655 match_gene_strand: bool = True,
656 include_extended_metadata: bool = True,
657) -> pd.DataFrame:
658 """Formats an :class:`~anndata.AnnData` score as a tidy DataFrame.
659
660 This function converts the score output from an AnnData object into a
661 long-format pandas DataFrame, where each row represents:
662
663 - For non-gene-centric variant scorers: Score for a variant-track pair.
664 - For non-gene-centric interval scorers: Score for an interval-track pair.
665 - For gene-centric variant/interval scoring: Score for a
666 variant/interval-gene-track combination.
667
668 Args:
669 adata: An AnnData object containing scores.
670 match_gene_strand: If True (and using gene-centric scoring), rows with
671 mismatched gene and track strands are removed.
672 include_extended_metadata: If True, includes additional columns derived from
673 metadata specific to the output type, such as biosample name and type,
674 gtex tissue, transcription factor, and histone mark, if available. If
675 False, only includes minimal metadata columns required to unique identify
676 a track withing a given output type: track_name and track_strand.
677
678 Returns:
679 A pandas DataFrame with one score per row. The DataFrame includes
680 columns for variant ID (if applicable), scored interval, gene information
681 (if applicable), output type, variant/interval scorer, track name,
682 ontology term, assay type, track strand, and raw score. Additional metadata
683 such as biosample name and type, gtex tissue are also returned (where
684 available). See :func:`full_path_to.tidy_scores` for more details on the
685 returned columns.
686
687 Raises:
688 ValueError: If the input is not an AnnData object.
689 """
690 if not isinstance(adata, anndata.AnnData):
691 raise ValueError('Invalid input type. Must be an AnnData object.')
692
693 # Columns to include from the gene metadata. If the column did not
694 # exist in the original metadata (or if the scores do not have a concept
695 # of a gene), a column of all Nones is added.
696 gene_columns = [
697 'gene_id',
698 'gene_name',
699 'gene_type',
700 'gene_strand',
701 'junction_Start',
702 'junction_End',
703 ]
704 if math.prod(adata.X.shape) == 0:
705 # Scores are empty, so we return an empty dataframe.
706 return pd.DataFrame()
707 elif 'gene_id' in adata.obs and 'strand' in adata.obs:
708 # Scores are for a gene-based scorer.
709 obs = adata.obs.rename({'strand': 'gene_strand'}, axis=1)
710

Callers 1

tidy_scoresFunction · 0.85

Calls 2

splitMethod · 0.80
getMethod · 0.45

Tested by

no test coverage detected