LaminDB is an open-source data framework for biology. You can find out about some of its features in the documentation of the lamindb Python package.
This vignette will show you how to use the laminr
package to interact with LaminDB.
Initial setup
As part of a first-time set up, you will need to install
laminr
, the Python lamin-cli
package, and set
up an instance for first use.
install.packages("remotes")
remotes::install_github("laminlabs/laminr")
Connect to a LaminDB instance
This vignette uses the laminlabs/cellxgene
instance, which is a LaminDB instance that interfaces the CELLxGENE
data.
You can connect to the instance using the connect
R
function:
By printing the instance, you can see which registries are available, including Artifact, Collection, Feature, etc. Each of these registries have a corresponding Python class.
db
#> cellxgene
#> Core registries
#> $Run
#> $User
#> $Param
#> $ULabel
#> $Feature
#> $Storage
#> $Artifact
#> $Transform
#> $Collection
#> $FeatureSet
#> $ParamValue
#> $FeatureValue
#> Core link tables
#> runparamvalue
#> artifactulabel
#> collectionulabel
#> featuresetfeature
#> artifactfeatureset
#> artifactparamvalue
#> collectionartifact
#> artifactfeaturevalue
#> Additional modules
#> bionty
All of the ‘core’ registries are directly available from the
db
object, while registries from other modules can be
accessed via db$<module_name>
, e.g.:
db$bionty
#> bionty
#> Registries
#> $Gene
#> $Source
#> $Tissue
#> $Disease
#> $Pathway
#> $Protein
#> $CellLine
#> $CellType
#> $Organism
#> $Ethnicity
#> $Phenotype
#> $CellMarker
#> $DevelopmentalStage
#> $ExperimentalFactor
#> Link tables
#> artifactgene
#> artifacttissue
#> featuresetgene
#> artifactdisease
#> artifactpathway
#> artifactprotein
#> artifactcellline
#> artifactcelltype
#> artifactorganism
#> artifactethnicity
#> artifactphenotype
#> featuresetpathway
#> featuresetprotein
#> artifactcellmarker
#> featuresetcellmarker
#> artifactdevelopmentalstage
#> artifactexperimentalfactor
The bionty
and other registries also have corresponding
Python classes.
Registry
A registry is used to query, store and manage data. For instance, the
Artifact
registry stores datasets and models as files,
folders, or arrays.
You can see which functions you can use to interact with the registry by printing the registry object:
db$Artifact
#> Artifact
#> Simple fields
#> id: AutoField
#> key: CharField
#> uid: CharField
#> hash: CharField
#> size: BigIntegerField
#> type: CharField
#> suffix: CharField
#> version: CharField
#> is_latest: BooleanField
#> n_objects: BigIntegerField
#> created_at: DateTimeField
#> updated_at: DateTimeField
#> visibility: SmallIntegerField
#> description: CharField
#> n_observations: BigIntegerField
#> Relational fields
#> run: run (many-to-one)
#> genes: gene (many-to-many)
#> storage: storage (many-to-one)
#> tissues: tissue (many-to-many)
#> ulabels: ulabel (many-to-many)
#> diseases: disease (many-to-many)
#> pathways: pathway (many-to-many)
#> proteins: protein (many-to-many)
#> organisms: organism (many-to-many)
#> transform: transform (many-to-one)
#> cell_lines: cellline (many-to-many)
#> cell_types: celltype (many-to-many)
#> created_by: user (many-to-one)
#> phenotypes: phenotype (many-to-many)
#> collections: collection (many-to-many)
#> ethnicities: ethnicity (many-to-many)
#> cell_markers: cellmarker (many-to-many)
#> feature_sets: featureset (many-to-many)
#> input_of_runs: run (many-to-many)
#> developmental_stages: developmentalstage (many-to-many)
#> experimental_factors: experimentalfactor (many-to-many)
For instance, you can fetch an Artifact by ID or UID. For example, Artifact KBW89Mf7IGcekja2hADu is an AnnData object containing myeloid cells.
artifact <- db$Artifact$get("KBW89Mf7IGcekja2hADu")
You can view its metadata by printing the object:
artifact
#> Artifact(uid='KBW89Mf7IGcekja2hADu', description='Myeloid compartment', key='cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad', storage_id=2, version='2024-07-01', _accessor='AnnData', id=3659, transform_id=22, size=691757462, is_latest=TRUE, created_by_id=1, type='dataset', _hash_type='md5-n', n_observations=51552, created_at='2024-07-12T12:34:10.345829+00:00', updated_at='2024-07-12T12:40:48.837026+00:00', run_id=27, suffix='.h5ad', visibility=1, _key_is_virtual=FALSE, hash='SZ5tB0T4YKfiUuUkAL09ZA')
Or get more detailed information by calling the
$describe()
method:
artifact$describe()
#> Artifact(uid='KBW89Mf7IGcekja2hADu', description='Myeloid compartment', key='cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad', storage_id=2, version='2024-07-01', _accessor='AnnData', id=3659, transform_id=22, size=691757462, is_latest=TRUE, created_by_id=1, type='dataset', _hash_type='md5-n', n_observations=51552, created_at='2024-07-12T12:34:10.345829+00:00', updated_at='2024-07-12T12:40:48.837026+00:00', run_id=27, suffix='.h5ad', visibility=1, _key_is_virtual=FALSE, hash='SZ5tB0T4YKfiUuUkAL09ZA')
#> Provenance
#> $storage = 's3://cellxgene-data-public'
#> $transform = 'Census release 2024-07-01 (LTS)'
#> $run = '2024-07-16T12:49:41.81955+00:00'
#> $created_by = 'sunnyosun'
You can access its fields as follows:
-
artifact$id
: 3659 -
artifact$uid
: KBW89Mf7IGcekja2hADu -
artifact$key
: cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad
Or fetch data from related registries:
-
artifact$root
: Storage(uid=‘oIYGbD74’, root=‘s3://cellxgene-data-public’, type=‘s3’, created_at=‘2023-09-19T13:17:56.273068+00:00’, updated_at=‘2023-10-16T15:04:08.998203+00:00’, id=2, created_by_id=1, region=‘us-west-2’) -
artifact$created_by
: User(uid=‘kmvZDIX9’, handle=‘sunnyosun’, name=‘Sunny Sun’, id=1, created_at=‘2023-09-19T12:02:50.765010+00:00’, updated_at=‘2023-12-13T16:23:44.195541+00:00’)
Finally, for Artifact objects, you can directly fetch or download the
data using $cache()
and $load()
,
respectively.
artifact$cache()
artifact$load()
#> ℹ s3://cellxgene-data-public/cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad already exists at /home/runner/.cache/lamindb/cellxgene-data-public/cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad
#> AnnData object with n_obs × n_vars = 51552 × 36398
#> obs: 'donor_id', 'Predicted_labels_CellTypist', 'Majority_voting_CellTypist', 'Manually_curated_celltype', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
#> var: 'gene_symbols', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
#> uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'sex_ontology_term_id_colors', 'title'
#> obsm: 'X_umap'
Only S3 storage and AnnData accessors are supported at the moment. If additional storage and data accessors are desired, please open an issue on the laminr GitHub repository.