Introduction
This vignette demonstrates a basic workflow for accessing and analysing single-cell RNA-seq data from the CELLxGENE repository using {laminr}. CZ CELLxGENE Discover is a standardised collection of scRNA-seq datasets and LaminDB makes it easy to query and access data in this repository. We will go through the steps of finding and downloading a dataset using {laminr}, performing some simple analysis using {Seurat} and saving the results your own LaminDB database.
Before we start
Before we go begin, please take some time to check out the Getting
Started vignette (vignette("laminr", package = "laminr")
).
In particular, make sure you have run the commands in the “Initial
Setup” section.
Once that is done, we can load the {laminr} library.
Connecting to LaminDB
The first thing we need to do is connect to the LaminDB database. For this tutorial, we will connect a default instance (where we will store results) and the CELLxGENE instance that we will search for datasets.
Connect to the default instance
We will start by connecting to your default LaminDB instance. You can
set set the default instance using the lamin
CLI on the
command line:
lamin connect <owner>/<name>
Once a default instance has been set, we can connect to it with {laminr}:
db <- connect()
#> ! schema module 'bionty' is not installed → no access to its labels & registries (resolve via `pip install bionty`)
#> → connected lamindb: laminlabs/cellxgene
db
#> cellxgene
#> Core registries
#> $Run
#> $User
#> $Param
#> $ULabel
#> $Feature
#> $Storage
#> $Artifact
#> $Transform
#> $Collection
#> $FeatureSet
#> $ParamValue
#> $FeatureValue
#> Additional modules
#> bionty
This gives us an object we can use to interact with the database.
Note that only the default instance can create new records. This tutorial assumes you have access to an instance where you have permission to add data.
Track data provenance
Before we start, we will track the code that is run in this notebook.
db$track("I8BlHXFXqZOG0000", path = "example_workflow.Rmd")
Tip: The ID should be obtained by running
db$track(path = "example_workflow.Rmd")
and copying the ID
from the output.
Connect to the CELLxGENE instance
We can connect to other instances by providing a slug to the
connect()
function. Instances connected to in this way can
be used to query data but cannot make any changes. Let’s connect to the
CELLxGENE instance:
cellxgene <- connect("laminlabs/cellxgene")
cellxgene
#> cellxgene
#> Core registries
#> $Run
#> $User
#> $Param
#> $ULabel
#> $Feature
#> $Storage
#> $Artifact
#> $Transform
#> $Collection
#> $FeatureSet
#> $ParamValue
#> $FeatureValue
#> Additional modules
#> bionty
Downloading a dataset
In Lamin, artifacts are objects that contain information (single-cell data, images, data frames etc.) as well as associated metadata. You can see what artifacts are available using the database instance object.
cellxgene$Artifact$df(limit = 5)
#> id suffix X_accessor n_objects visibility
#> 1 2846 tiledbsoma 290 1
#> 2 3665 tiledbsoma 330 1
#> 3 1270 .h5ad AnnData NA 1
#> 4 2840 .ipynb <NA> NA 0
#> 5 2842 .html <NA> NA 0
#> key
#> 1 cell-census/2023-12-15/soma
#> 2 cell-census/2024-07-01/soma
#> 3 cell-census/2023-07-25/h5ads/7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad
#> 4 <NA>
#> 5 <NA>
#> uid size hash
#> 1 FYMewVq5twKMDXVy0000 635848093433 Mfyw8VuqftX5REITfQH_yg
#> 2 FYMewVq5twKMDXVy0001 870700998221 bzrXBPNvitSVKvb3GG38_w
#> 3 tczTlSHFPOcAcBnfyxKA 1297573950 UlsVvBz9kMzn2r9RdoAAOg
#> 4 JIIPyQX5l9qELPl42d75 36297 gNdUkonYgQJP_Mi3xLzt_g
#> 5 Whyxwf3k2GjJwTPCl1FK 716529 BDGZac3qU3oLVFpO035Qhg
#> description n_observations is_latest X_hash_type
#> 1 Census 2023-12-15 68683222 FALSE md5-d
#> 2 Census 2024-07-01 115556140 TRUE md5-d
#> 3 Supercluster: Hippocampal CA1-3 74979 FALSE md5-n
#> 4 Source of transform G69jtgzKO0eJ6K79 NA FALSE md5
#> 5 Report of run UAAiLAi0BrLvlKnsuvP3 NA FALSE md5
#> type created_at X_key_is_virtual
#> 1 dataset 2024-07-12T12:12:16.091881+00:00 FALSE
#> 2 dataset 2024-07-16T12:52:01.424629+00:00 FALSE
#> 3 <NA> 2023-11-28T21:46:12.685907+00:00 FALSE
#> 4 <NA> 2024-01-29T08:32:13.311741+00:00 TRUE
#> 5 <NA> 2024-01-29T08:32:18.346499+00:00 TRUE
#> updated_at version
#> 1 2024-09-17T13:00:13.714256+00:00 2023-12-15
#> 2 2024-09-17T13:01:23.739635+00:00 2024-07-01
#> 3 2024-01-24T07:10:21.725547+00:00 2023-07-25
#> 4 2024-01-29T08:32:13.311792+00:00 0
#> 5 2024-01-30T09:12:06.027928+00:00 1
This is useful, but it’s not the nicest or easiest way to find a particular dataset. Instead, we will use the Lamin Hub website to find the data we want to load.
- Open a browser and go to https://lamin.ai/laminlabs/cellxgene
- On the top toolbar, click the “Artifacts” tab
- Use the search field and the filters to find a dataset you are interested in.
- We use the “Suffix” filter to find
.h5ad
files and search for “renal cell carcinoma”
- Select the entry for the dataset you want to load to open a page with more details
- Click the copy button at the top right, this copies a command including the ID for the artifact
Once we have the artifact ID, we can load information about the artifact, similar to what we see on the website. Notice that we use a slightly different command to what we copied from the website.
artifact <- cellxgene$Artifact$get("7dVluLROpalzEh8mNyxk")
artifact
#> Artifact(uid='7dVluLROpalzEh8mNyxk', description='Renal cell carcinoma, pre aPD1, kidney Puck_200727_12', key='cell-census/2023-12-15/h5ads/02faf712-92d4-4589-bec7-13105059cf86.h5ad', id=1742, run_id=22, hash='YNYuokfAoDFxdaRILjmU9w', size=13997860, suffix='.h5ad', storage_id=2, version='2023-12-15', _accessor='AnnData', is_latest=TRUE, transform_id=16, _hash_type='md5-n', created_at='2024-01-11T09:13:23.143694+00:00', created_by_id=1, updated_at='2024-01-24T07:17:47.009288+00:00', visibility=1, n_observations=17612, _key_is_virtual=FALSE)
So far we have only retrieved the metadata about this object. To download the data itself we need to run another command.
adata <- artifact$load()
#> | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================ | 24% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |===================== | 31% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |============================== | 44% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================ | 64% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |=================================================== | 74% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |======================================================== | 81% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================= | 94% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
adata
#> AnnData object with n_obs × n_vars = 17612 × 23254
#> obs: 'n_genes', 'n_UMIs', 'log10_n_UMIs', 'log10_n_genes', 'Cell_Type', 'cell_type_ontology_term_id', 'organism_ontology_term_id', 'tissue_ontology_term_id', 'assay_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'donor_id', 'is_primary_data', 'suspension_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
#> var: 'gene', 'n_beads', 'n_UMIs', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
#> uns: 'Cell_Type_colors', 'schema_version', 'title'
#> obsm: 'X_spatial'
This dataset has been stored as an AnnData
object.
In the next sections we will convert it to a Seurat
object and
perform some simple analysis.
Convert to Seurat
There are various approaches for converting between different single-cell objects, some of which are described in the Interoperability chapter of the Single-cell Best Practices book.
Because we already have the data loaded in memory, the simplest
option is to extract the information we need and create a new
Seurat
object.
seurat <- SeuratObject::CreateSeuratObject(
counts = Matrix::t(adata$X),
meta.data = adata$obs,
)
#> Warning: Data is of class dgRMatrix. Coercing to dgCMatrix.
seurat
#> An object of class Seurat
#> 23254 features across 17612 samples within 1 assay
#> Active assay: RNA (23254 features, 0 variable features)
#> 1 layer present: counts
Analysis
We could perform any normal analysis using {Seurat} but as an example we will calculate marker genes for each of the annotated cell types. To make things a bit quicker we only test the first 1000 genes but if you have a few minutes you can get results for all features.
# Set cell identities to the provided cell type annotation
SeuratObject::Idents(seurat) <- "Cell_Type"
# Normalise the data
seurat <- Seurat::NormalizeData(seurat)
#> Normalizing layer: counts
# Test for marker genes
markers <- Seurat::FindAllMarkers(
seurat,
features = SeuratObject::Features(seurat)[1:1000]
)
#> Calculating cluster Epithelial
#> Calculating cluster Fibroblast
#> For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
#> (default method for FindMarkers) please install the presto package
#> --------------------------------------------
#> install.packages('devtools')
#> devtools::install_github('immunogenomics/presto')
#> --------------------------------------------
#> After installation of presto, Seurat will automatically use the more
#> efficient implementation (no further action necessary).
#> This message will be shown once per session
#> Calculating cluster Myeloid
#> Calculating cluster Tumor
#> Warning: The following tests were not performed:
#> Warning: When testing Epithelial versus all:
#> Cell group 1 has fewer than 3 cells
# The output is a data.frame
head(markers)
#> p_val avg_log2FC pct.1 pct.2 p_val_adj cluster
#> ENSG00000164283 1.030703e-89 2.7485040 0.205 0.048 2.396797e-85 Fibroblast
#> ENSG00000116016 3.606838e-38 2.0721038 0.152 0.051 8.387340e-34 Fibroblast
#> ENSG00000074800 5.097282e-25 -0.9810317 0.185 0.366 1.185322e-20 Fibroblast
#> ENSG00000112715 6.663398e-18 -1.1826785 0.078 0.202 1.549507e-13 Fibroblast
#> ENSG00000140416 1.844156e-17 -0.6994000 0.175 0.326 4.288400e-13 Fibroblast
#> ENSG00000125810 8.916133e-15 1.8102270 0.057 0.019 2.073358e-10 Fibroblast
#> gene
#> ENSG00000164283 ENSG00000164283
#> ENSG00000116016 ENSG00000116016
#> ENSG00000074800 ENSG00000074800
#> ENSG00000112715 ENSG00000112715
#> ENSG00000140416 ENSG00000140416
#> ENSG00000125810 ENSG00000125810
Store the results in LaminDB
Now that we have our results, we can save them to the LaminDB instance.
Render and upload the notebook
You can render this notebook to HTML:
In RStudio, click the “Knit” button
-
From the command line, run:
-
Or use the
rmarkdown
package in R:rmarkdown::render("example_workflow.Rmd")
And then save it to your LaminDB instance using the
lamin
CLI: