hub / github.com/google-deepmind/alphagenome / tidy_anndata

Function tidy_anndata

src/alphagenome/models/variant_scorers.py:653–778 · view source on GitHub ↗

Formats an :class:`~anndata.AnnData` score as a tidy DataFrame. This function converts the score output from an AnnData object into a long-format pandas DataFrame, where each row represents: - For non-gene-centric variant scorers: Score for a variant-track pair. - For non-gene-centric

(
    adata: anndata.AnnData,
    match_gene_strand: bool = True,
    include_extended_metadata: bool = True,
)

Source from the content-addressed store, hash-verified

651
652
653	def tidy_anndata(
654	adata: anndata.AnnData,
655	match_gene_strand: bool = True,
656	include_extended_metadata: bool = True,
657	) -> pd.DataFrame:
658	"""Formats an :class:`~anndata.AnnData` score as a tidy DataFrame.
659
660	This function converts the score output from an AnnData object into a
661	long-format pandas DataFrame, where each row represents:
662
663	- For non-gene-centric variant scorers: Score for a variant-track pair.
664	- For non-gene-centric interval scorers: Score for an interval-track pair.
665	- For gene-centric variant/interval scoring: Score for a
666	variant/interval-gene-track combination.
667
668	Args:
669	adata: An AnnData object containing scores.
670	match_gene_strand: If True (and using gene-centric scoring), rows with
671	mismatched gene and track strands are removed.
672	include_extended_metadata: If True, includes additional columns derived from
673	metadata specific to the output type, such as biosample name and type,
674	gtex tissue, transcription factor, and histone mark, if available. If
675	False, only includes minimal metadata columns required to unique identify
676	a track withing a given output type: track_name and track_strand.
677
678	Returns:
679	A pandas DataFrame with one score per row. The DataFrame includes
680	columns for variant ID (if applicable), scored interval, gene information
681	(if applicable), output type, variant/interval scorer, track name,
682	ontology term, assay type, track strand, and raw score. Additional metadata
683	such as biosample name and type, gtex tissue are also returned (where
684	available). See :func:`full_path_to.tidy_scores` for more details on the
685	returned columns.
686
687	Raises:
688	ValueError: If the input is not an AnnData object.
689	"""
690	if not isinstance(adata, anndata.AnnData):
691	raise ValueError('Invalid input type. Must be an AnnData object.')
692
693	# Columns to include from the gene metadata. If the column did not
694	# exist in the original metadata (or if the scores do not have a concept
695	# of a gene), a column of all Nones is added.
696	gene_columns = [
697	'gene_id',
698	'gene_name',
699	'gene_type',
700	'gene_strand',
701	'junction_Start',
702	'junction_End',
703	]
704	if math.prod(adata.X.shape) == 0:
705	# Scores are empty, so we return an empty dataframe.
706	return pd.DataFrame()
707	elif 'gene_id' in adata.obs and 'strand' in adata.obs:
708	# Scores are for a gene-based scorer.
709	obs = adata.obs.rename({'strand': 'gene_strand'}, axis=1)
710

Callers 1

tidy_scoresFunction · 0.85

Calls 2

splitMethod · 0.80

getMethod · 0.45

Tested by

no test coverage detected