前面已經說了很多的基礎概念和分析算法,但是大家應該注意到了,前面的分析都限制在TCRαβ序列,分析還是具有一定的限制性,今天我們稍微做一下擴展,優化TCR分析距離的同時,分析γδ TCRs,and at-scale computation with sparse data representations and parallelized, byte-compiled code.文獻在TCR meta-clonotypes for biomarker discovery with tcrdist3: identification of public, HLA restricted SARS-CoV-2 associated TCR features,這些文獻都是具有承前啟后的作用,這個專題,內容真的太多了。
來,看看分析框架
1、實驗性抗原富集可以發現具有生化相似neighbors的 TCR
Searching for identical TCRs within a repertoire - arising either from clonal expansion or convergent nucleotide encoding of amino acids in the CDR3 - is a common strategy for identifying functionally important receptors。(這也是唯一實用的策略),然而,在缺乏實驗富集程序的情況下,在大量樣本中觀察到具有相同氨基酸 TCR 序列的 T 細胞是很少見的。例如,在來自臍帶血樣本的 10,000 個 β 鏈 TCR 中,少于 1% 的 TCR 氨基酸序列被多次觀察到,包括可能的克隆擴增(疾病確實會導致TCR的特異性擴增,這是研究的核心).
圖片.png
- 圖注:TCR repertoire subsets obtained by single-cell sorting with peptide-MHC tetramers。
2、TCR biochemical neighborhood density is heterogeneous in antigen-enriched repertoires
We next investigated the proportion of unique TCRs with at least one biochemically similar neighbor among TCRs with the same putative antigen specificity.We and others have shown that a single peptide-MHC epitope is often recognized by many distinct TCRs with closely related amino acid sequences(識別抗原的TCR序列具有多樣性,多對一的關系,這就復雜了),這個時候就必須找序列之間的相似性(也就是前面提到的TCR instance),以尋求共性。We observed the highest density neighborhoods within repertoires that were sorted based on peptide-MHC tetramer binding(看來刺激的作用很明顯)。these observations suggest that biochemical neighborhood density is highly heterogeneous among TCRs and that it may depend on mechanisms of antigen-recognition as well as receptor V(D)J recombination biases。(按照這個情況,這個難以研究)。
3、Meta-clonotype radius can be tuned to balance a biomarker’s sensitivity and specificity
基于 TCR 的生物標志物的效用取決于 TCR 的抗原特異性 ,a key constraint on distance-based clustering is the presence of similar TCR sequences that may lack the ability to recognize the target antigen.(說白了就行要定義相似性的半徑),To be useful, a meta-clonotype definition should be broad enough to capture multiple biochemically similar TCRs with shared antigen-recognition, but not excessively broad as to include a high proportion of non-specific TCRs, which might be found in unenriched background repertoires that are largely antigen-na?ve(半徑的大小要合適)。但是TCR“鄰居”的相似性密度是異質的。
An ideal radius-defined meta-clonotype would include a high density of TCRs in antigen experienced individuals indicative of shared antigen specificity, yet a low density of TCRs among an antigen-na?ve background.接下來就是尋找抗原轉移性的TCR序列了。我們來看看分析的代碼(TCRdist3)。
第一部分代碼,TCRdist
看看輸入的數據格式,跟我們10X分析出來的結果很類似
圖片.png
來,現場教大家寫代碼
103338268-aa3ee180-4a32-11eb-8149-056fb385b33b.gif
默認參數
"""
If you just want a 'tcrdistances' using pre-set default setting.
You can access distance matrices:
tr.pw_alpha - alpha chain pairwise distance matrix
tr.pw_beta - alpha chain pairwise distance matrix
tr.pw_cdr3_a_aa - cdr3 alpha chain distance matrix
tr.pw_cdr3_b_aa - cdr3 beta chain distance matrix
"""
import pandas as pd
from tcrdist.repertoire import TCRrep
df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
db_file = 'alphabeta_gammadelta_db.tsv')
tr.pw_alpha
tr.pw_beta
tr.pw_cdr3_a_aa
tr.pw_cdr3_b_aa
調整一個默認參數
"""
If you want 'tcrdistances' with changes over some parameters.
For instance you want to change the gap penalty on CDR3s to 5.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
compute_distances = False,
db_file = 'alphabeta_gammadelta_db.tsv')
tr.kargs_a['cdr3_a_aa']['gap_penalty'] = 5
tr.kargs_b['cdr3_b_aa']['gap_penalty'] = 5
tr.compute_distances()
tr.pw_alpha
tr.pw_beta
人為完全控制距離的計算(對代碼的水平要求有點高)
"""
If want a 'tcrdistances' AND you want control over EVERY parameter.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
df = pd.read_csv("dash.csv")
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
compute_distances = False,
db_file = 'alphabeta_gammadelta_db.tsv')
metrics_a = {
"cdr3_a_aa" : pw.metrics.nb_vector_tcrdist,
"pmhc_a_aa" : pw.metrics.nb_vector_tcrdist,
"cdr2_a_aa" : pw.metrics.nb_vector_tcrdist,
"cdr1_a_aa" : pw.metrics.nb_vector_tcrdist}
metrics_b = {
"cdr3_b_aa" : pw.metrics.nb_vector_tcrdist,
"pmhc_b_aa" : pw.metrics.nb_vector_tcrdist,
"cdr2_b_aa" : pw.metrics.nb_vector_tcrdist,
"cdr1_b_aa" : pw.metrics.nb_vector_tcrdist }
weights_a= {
"cdr3_a_aa" : 3,
"pmhc_a_aa" : 1,
"cdr2_a_aa" : 1,
"cdr1_a_aa" : 1}
weights_b = {
"cdr3_b_aa" : 3,
"pmhc_b_aa" : 1,
"cdr2_b_aa" : 1,
"cdr1_b_aa" : 1}
kargs_a = {
'cdr3_a_aa' :
{'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight': 1,
'gap_penalty':4,
'ntrim':3,
'ctrim':2,
'fixed_gappos': False},
'pmhc_a_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight':1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True},
'cdr2_a_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight': 1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True},
'cdr1_a_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight':1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True}
}
kargs_b= {
'cdr3_b_aa' :
{'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight': 1,
'gap_penalty':4,
'ntrim':3,
'ctrim':2,
'fixed_gappos': False},
'pmhc_b_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight': 1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True},
'cdr2_b_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight':1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True},
'cdr1_b_aa' : {
'use_numba': True,
'distance_matrix': pw.matrices.tcr_nb_distance_matrix,
'dist_weight':1,
'gap_penalty':4,
'ntrim':0,
'ctrim':0,
'fixed_gappos':True}
}
tr.metrics_a = metrics_a
tr.metrics_b = metrics_b
tr.weights_a = weights_a
tr.weights_b = weights_b
tr.kargs_a = kargs_a
tr.kargs_b = kargs_b
只考慮不匹配的計算
"""
If you want "tcrdistances" using a different metric.
Here we illustrate the use a metric that uses the
Needleman-Wunsch algorithm to align sequences and then
calculate the number of mismatching positions (pw.metrics.nw_hamming_metric)
This method doesn't rely on Numba so it can run faster using multiple cpus.
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
import multiprocessing
df = pd.read_csv("dash.csv")
df = df.head(100) # for faster testing
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
use_defaults=False,
compute_distances = False,
cpus = 1,
db_file = 'alphabeta_gammadelta_db.tsv')
metrics_a = {
"cdr3_a_aa" : pw.metrics.nw_hamming_metric ,
"pmhc_a_aa" : pw.metrics.nw_hamming_metric ,
"cdr2_a_aa" : pw.metrics.nw_hamming_metric ,
"cdr1_a_aa" : pw.metrics.nw_hamming_metric }
metrics_b = {
"cdr3_b_aa" : pw.metrics.nw_hamming_metric ,
"pmhc_b_aa" : pw.metrics.nw_hamming_metric ,
"cdr2_b_aa" : pw.metrics.nw_hamming_metric ,
"cdr1_b_aa" : pw.metrics.nw_hamming_metric }
weights_a = {
"cdr3_a_aa" : 1,
"pmhc_a_aa" : 1,
"cdr2_a_aa" : 1,
"cdr1_a_aa" : 1}
weights_b = {
"cdr3_b_aa" : 1,
"pmhc_b_aa" : 1,
"cdr2_b_aa" : 1,
"cdr1_b_aa" : 1}
kargs_a = {
'cdr3_a_aa' :
{'use_numba': False},
'pmhc_a_aa' : {
'use_numba': False},
'cdr2_a_aa' : {
'use_numba': False},
'cdr1_a_aa' : {
'use_numba': False}
}
kargs_b = {
'cdr3_b_aa' :
{'use_numba': False},
'pmhc_b_aa' : {
'use_numba': False},
'cdr2_b_aa' : {
'use_numba': False},
'cdr1_b_aa' : {
'use_numba': False}
}
tr.metrics_a = metrics_a
tr.metrics_b = metrics_b
tr.weights_a = weights_a
tr.weights_b = weights_b
tr.kargs_a = kargs_a
tr.kargs_b = kargs_b
tr.compute_distances()
tr.pw_cdr3_b_aa
tr.pw_beta
自定義距離度量
"""
If you want a tcrdistance, but you want to use your own metric.
(A valid metric takes two strings and returns a numerical distance).
def my_own_metric(s1,s2):
return Levenshtein.distance(s1,s2)
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
import multiprocessing
df = pd.read_csv("dash.csv")
df = df.head(100) # for faster testing
tr = TCRrep(cell_df = df,
organism = 'mouse',
chains = ['alpha','beta'],
use_defaults=False,
compute_distances = False,
cpus = 1,
db_file = 'alphabeta_gammadelta_db.tsv')
metrics_a = {
"cdr3_a_aa" : my_own_metric ,
"pmhc_a_aa" : my_own_metric ,
"cdr2_a_aa" : my_own_metric ,
"cdr1_a_aa" : my_own_metric }
metrics_b = {
"cdr3_b_aa" : my_own_metric ,
"pmhc_b_aa" : my_own_metric ,
"cdr2_b_aa" : my_own_metric,
"cdr1_b_aa" : my_own_metric }
weights_a = {
"cdr3_a_aa" : 1,
"pmhc_a_aa" : 1,
"cdr2_a_aa" : 1,
"cdr1_a_aa" : 1}
weights_b = {
"cdr3_b_aa" : 1,
"pmhc_b_aa" : 1,
"cdr2_b_aa" : 1,
"cdr1_b_aa" : 1}
kargs_a = {
'cdr3_a_aa' :
{'use_numba': False},
'pmhc_a_aa' : {
'use_numba': False},
'cdr2_a_aa' : {
'use_numba': False},
'cdr1_a_aa' : {
'use_numba': False}
}
kargs_b = {
'cdr3_b_aa' :
{'use_numba': False},
'pmhc_b_aa' : {
'use_numba': False},
'cdr2_b_aa' : {
'use_numba': False},
'cdr1_b_aa' : {
'use_numba': False}
}
tr.metrics_a = metrics_a
tr.metrics_b = metrics_b
tr.weights_a = weights_a
tr.weights_b = weights_b
tr.kargs_a = kargs_a
tr.kargs_b = kargs_b
tr.compute_distances()
tr.pw_cdr3_b_aa
tr.pw_beta
I want tcrdistances, but I hate OOP
"""
If you don't want to use OOP, but you I still want a multi-CDR
tcrdistances on a single chain, using you own metric
def my_own_metric(s1,s2):
return Levenshtein.distance(s1,s2)
"""
import multiprocessing
import pandas as pd
from tcrdist.rep_funcs import _pws, _pw
df = pd.read_csv("dash2.csv")
metrics_b = {
"cdr3_b_aa" : my_own_metric ,
"pmhc_b_aa" : my_own_metric ,
"cdr2_b_aa" : my_own_metric ,
"cdr1_b_aa" : my_own_metric }
weights_b = {
"cdr3_b_aa" : 1,
"pmhc_b_aa" : 1,
"cdr2_b_aa" : 1,
"cdr1_b_aa" : 1}
kargs_b = {
'cdr3_b_aa' :
{'use_numba': False},
'pmhc_b_aa' : {
'use_numba': False},
'cdr2_b_aa' : {
'use_numba': False},
'cdr1_b_aa' : {
'use_numba': False}
}
dmats = _pws(df = df ,
metrics = metrics_b,
weights = weights_b,
kargs = kargs_b ,
cpu = 1,
uniquify= True,
store = True)
print(dmats.keys())
僅考慮CDR3
"""
If you hate object oriented programming, just show me the functions.
No problem.
Maybe you only care about the CDR3 on the beta chain.
def my_own_metric(s1,s2):
return Levenshtein.distance(s1,s2)
"""
import multiprocessing
import pandas as pd
from tcrdist.rep_funcs import _pws, _pw
df = pd.read_csv("dash2.csv")
#
dmat = _pw( metric = my_own_metric,
seqs1 = df['cdr3_b_aa'].values,
ncpus=2,
uniqify=True,
use_numba=False)
I want tcrdistances but I want to keep my variable names
"""
You want a 'tcrdistance' but you don't want to bother with the tcrdist3 framework.
Note that the columns names are completely arbitrary under this
framework, so one can directly compute a tcrdist on a
AIRR, MIXCR, VDJTools, or other formated file without any
reformatting.
"""
import multiprocessing
import pandas as pd
import pwseqdist as pw
from tcrdist.rep_funcs import _pws, _pw
df_airr = pd.read_csv("dash_beta_airr.csv")
# Choose the metrics you want to apply to each CDR
metrics = { 'cdr3_aa' : pw.metrics.nb_vector_tcrdist,
'cdr2_aa' : pw.metrics.nb_vector_tcrdist,
'cdr1_aa' : pw.metrics.nb_vector_tcrdist}
# Choose the weights that are right for you.
weights = { 'cdr3_aa' : 3,
'cdr2_aa' : 1,
'cdr1_aa' : 1}
# Provide arguments for the distance metrics
kargs = { 'cdr3_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':3, 'ctrim':2, 'fixed_gappos':False},
'cdr2_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True},
'cdr1_aa' : {'use_numba': True, 'distance_matrix': pw.matrices.tcr_nb_distance_matrix, 'dist_weight': 1, 'gap_penalty':4, 'ntrim':0, 'ctrim':0, 'fixed_gappos':True}}
# Here are your distance matrices
from tcrdist.rep_funcs import _pws
dmats = _pws(df = df_airr,
metrics = metrics,
weights= weights,
kargs=kargs,
cpu = 1,
store = True)
dmats['tcrdist']
I want to use TCRrep but I want to keep my variable names
"""
If you already have a clones file and want
to compute 'tcrdistances' on a DataFrame with
custom columns names.
Set:
1. Assign TCRrep.clone_df
2. set infer_cdrs = False,
3. compute_distances = False
4. deduplicate = False
5. customize the keys for metrics, weights, and kargs with the lambda
customize = lambda d : {new_cols[k]:v for k,v in d.items()}
6. call .calculate_distances()
"""
import pwseqdist as pw
import pandas as pd
from tcrdist.repertoire import TCRrep
new_cols = {'cdr3_a_aa':'c3a', 'pmhc_a_aa':'pa', 'cdr2_a_aa':'c2a','cdr1_a_aa':'c1a',
'cdr3_b_aa':'c3b', 'pmhc_b_aa':'pb', 'cdr2_b_aa':'c2b','cdr1_b_aa':'c1b'}
df = pd.read_csv("dash2.csv").rename(columns = new_cols)
tr = TCRrep(
cell_df = df,
clone_df = df, #(1)
organism = 'mouse',
chains = ['alpha','beta'],
infer_all_genes = True,
infer_cdrs = False, #(2)s
compute_distances = False, #(3)
deduplicate=False, #(4)
db_file = 'alphabeta_gammadelta_db.tsv')
customize = lambda d : {new_cols[k]:v for k,v in d.items()} #(5)
tr.metrics_a = customize(tr.metrics_a)
tr.metrics_b = customize(tr.metrics_b)
tr.weights_a = customize(tr.weights_a)
tr.weights_b = customize(tr.weights_b)
tr.kargs_a = customize(tr.kargs_a)
tr.kargs_b = customize(tr.kargs_b)
tr.compute_distances() #(6)
# Notice that pairwise results now have custom names
tr.pw_c3b
tr.pw_c3a
tr.pw_alpha
tr.pw_beta
####### I want distances from 1 TCR to many TCRs
"""
If you just want a 'tcrdistances' of some target seqs against another set.
(1) cell_df is asigned the first 10 cells in dash.csv
(2) compute tcrdistances with default settings.
(3) compute rectangular distance between clone_df and df2.
(4) compute rectangular distance between clone_df and any
arbtirary df3, which need not be associated with the TCRrep object.
(5) compute rectangular distance with only a subset of the TCRrep.clone_df
"""
import pandas as pd
from tcrdist.repertoire import TCRrep
df = pd.read_csv("dash.csv")
df2 = pd.read_csv("dash2.csv")
df = df.head(10) #(1)
tr = TCRrep(cell_df = df, #(2)
df2 = df2,
organism = 'mouse',
chains = ['alpha','beta'],
db_file = 'alphabeta_gammadelta_db.tsv')
assert tr.pw_alpha.shape == (10,10)
assert tr.pw_beta.shape == (10,10)
tr.compute_rect_distances() # (3)
assert tr.rw_alpha.shape == (10,1924)
assert tr.rw_beta.shape == (10,1924)
df3 = df2.head(100)
tr.compute_rect_distances(df = tr.clone_df, df2 = df3) # (4)
assert tr.rw_alpha.shape == (10,100)
assert tr.rw_beta.shape == (10,100)
tr.compute_rect_distances( df = tr.clone_df.iloc[0:2,], # (5)
df2 = df3)
assert tr.rw_alpha.shape == (2,100)
assert tr.rw_beta.shape == (2,100)
個性化程度真的高,也確實很難
生活很好,有你更好,下一篇我們繼續分享TCRdist3的分析代碼