Skip to contents

LaminDB is an open-source data framework for biology. You can find out about some of its features in the documentation of the lamindb Python package.

This vignette will show you how to use the laminr package to interact with LaminDB.

Initial setup

As part of a first-time set up, you will need to install laminr, the Python lamin-cli package, and set up an instance for first use.

pip install lamin-cli
lamin connect laminlabs/cellxgene
install.packages("remotes")
remotes::install_github("laminlabs/laminr")

Connect to a LaminDB instance

This vignette uses the laminlabs/cellxgene instance, which is a LaminDB instance that interfaces the CELLxGENE data.

You can connect to the instance using the connect R function:

library(laminr)

db <- connect("laminlabs/cellxgene")

By printing the instance, you can see which registries are available, including Artifact, Collection, Feature, etc. Each of these registries have a corresponding Python class.

db
#> cellxgene
#>   Core registries
#>     $Run
#>     $User
#>     $Param
#>     $ULabel
#>     $Feature
#>     $Storage
#>     $Artifact
#>     $Transform
#>     $Collection
#>     $FeatureSet
#>     $ParamValue
#>     $FeatureValue
#>   Core link tables
#>     runparamvalue
#>     artifactulabel
#>     collectionulabel
#>     featuresetfeature
#>     artifactfeatureset
#>     artifactparamvalue
#>     collectionartifact
#>     artifactfeaturevalue
#>   Additional modules
#>     bionty

All of the ‘core’ registries are directly available from the db object, while registries from other modules can be accessed via db$<module_name>, e.g.:

db$bionty
#> bionty
#>   Registries
#>     $Gene
#>     $Source
#>     $Tissue
#>     $Disease
#>     $Pathway
#>     $Protein
#>     $CellLine
#>     $CellType
#>     $Organism
#>     $Ethnicity
#>     $Phenotype
#>     $CellMarker
#>     $DevelopmentalStage
#>     $ExperimentalFactor
#>   Link tables
#>     artifactgene
#>     artifacttissue
#>     featuresetgene
#>     artifactdisease
#>     artifactpathway
#>     artifactprotein
#>     artifactcellline
#>     artifactcelltype
#>     artifactorganism
#>     artifactethnicity
#>     artifactphenotype
#>     featuresetpathway
#>     featuresetprotein
#>     artifactcellmarker
#>     featuresetcellmarker
#>     artifactdevelopmentalstage
#>     artifactexperimentalfactor

The bionty and other registries also have corresponding Python classes.

Registry

A registry is used to query, store and manage data. For instance, the Artifact registry stores datasets and models as files, folders, or arrays.

You can see which functions you can use to interact with the registry by printing the registry object:

db$Artifact
#> Artifact
#>   Simple fields
#>     id: AutoField
#>     key: CharField
#>     uid: CharField
#>     hash: CharField
#>     size: BigIntegerField
#>     type: CharField
#>     suffix: CharField
#>     version: CharField
#>     is_latest: BooleanField
#>     n_objects: BigIntegerField
#>     created_at: DateTimeField
#>     updated_at: DateTimeField
#>     visibility: SmallIntegerField
#>     description: CharField
#>     n_observations: BigIntegerField
#>   Relational fields
#>     run: run (many-to-one)
#>     genes: gene (many-to-many)
#>     storage: storage (many-to-one)
#>     tissues: tissue (many-to-many)
#>     ulabels: ulabel (many-to-many)
#>     diseases: disease (many-to-many)
#>     pathways: pathway (many-to-many)
#>     proteins: protein (many-to-many)
#>     organisms: organism (many-to-many)
#>     transform: transform (many-to-one)
#>     cell_lines: cellline (many-to-many)
#>     cell_types: celltype (many-to-many)
#>     created_by: user (many-to-one)
#>     phenotypes: phenotype (many-to-many)
#>     collections: collection (many-to-many)
#>     ethnicities: ethnicity (many-to-many)
#>     cell_markers: cellmarker (many-to-many)
#>     feature_sets: featureset (many-to-many)
#>     input_of_runs: run (many-to-many)
#>     developmental_stages: developmentalstage (many-to-many)
#>     experimental_factors: experimentalfactor (many-to-many)

For instance, you can fetch an Artifact by ID or UID. For example, Artifact KBW89Mf7IGcekja2hADu is an AnnData object containing myeloid cells.

artifact <- db$Artifact$get("KBW89Mf7IGcekja2hADu")

You can view its metadata by printing the object:

artifact
#> Artifact(uid='KBW89Mf7IGcekja2hADu', description='Myeloid compartment', key='cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad', storage_id=2, version='2024-07-01', _accessor='AnnData', id=3659, transform_id=22, size=691757462, is_latest=TRUE, created_by_id=1, type='dataset', _hash_type='md5-n', n_observations=51552, created_at='2024-07-12T12:34:10.345829+00:00', updated_at='2024-07-12T12:40:48.837026+00:00', run_id=27, suffix='.h5ad', visibility=1, _key_is_virtual=FALSE, hash='SZ5tB0T4YKfiUuUkAL09ZA')

Or get more detailed information by calling the $describe() method:

artifact$describe()
#> Artifact(uid='KBW89Mf7IGcekja2hADu', description='Myeloid compartment', key='cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad', storage_id=2, version='2024-07-01', _accessor='AnnData', id=3659, transform_id=22, size=691757462, is_latest=TRUE, created_by_id=1, type='dataset', _hash_type='md5-n', n_observations=51552, created_at='2024-07-12T12:34:10.345829+00:00', updated_at='2024-07-12T12:40:48.837026+00:00', run_id=27, suffix='.h5ad', visibility=1, _key_is_virtual=FALSE, hash='SZ5tB0T4YKfiUuUkAL09ZA')
#>   Provenance
#>     $storage = 's3://cellxgene-data-public'
#>     $transform = 'Census release 2024-07-01 (LTS)'
#>     $run = '2024-07-16T12:49:41.81955+00:00'
#>     $created_by = 'sunnyosun'

You can access its fields as follows:

  • artifact$id: 3659
  • artifact$uid: KBW89Mf7IGcekja2hADu
  • artifact$key: cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad

Or fetch data from related registries:

  • artifact$root: Storage(uid=‘oIYGbD74’, root=‘s3://cellxgene-data-public’, type=‘s3’, created_at=‘2023-09-19T13:17:56.273068+00:00’, updated_at=‘2023-10-16T15:04:08.998203+00:00’, id=2, created_by_id=1, region=‘us-west-2’)
  • artifact$created_by: User(uid=‘kmvZDIX9’, handle=‘sunnyosun’, name=‘Sunny Sun’, id=1, created_at=‘2023-09-19T12:02:50.765010+00:00’, updated_at=‘2023-12-13T16:23:44.195541+00:00’)

Finally, for Artifact objects, you can directly fetch or download the data using $cache() and $load(), respectively.

artifact$cache()
artifact$load()
#>  s3://cellxgene-data-public/cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad already exists at /home/runner/.cache/lamindb/cellxgene-data-public/cell-census/2024-07-01/h5ads/fe52003e-1460-4a65-a213-2bb1a508332f.h5ad
#> AnnData object with n_obs × n_vars = 51552 × 36398
#>     obs: 'donor_id', 'Predicted_labels_CellTypist', 'Majority_voting_CellTypist', 'Manually_curated_celltype', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'sex_ontology_term_id', 'tissue_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
#>     var: 'gene_symbols', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
#>     uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'sex_ontology_term_id_colors', 'title'
#>     obsm: 'X_umap'

Only S3 storage and AnnData accessors are supported at the moment. If additional storage and data accessors are desired, please open an issue on the laminr GitHub repository.