Quick Start
This python script describes how to use LiRTMaTS python package. The input data and retention time reference files used here are in https://github.com/wanchanglin/lirtmats/tree/master/examples/data.
Setup
The users need to load python package LAMP before using LiRTMaTS. It’s functions used here are for loading data set and summarising the matching results. For details, see https://github.com/wanchanglin/lamp.
[1]:
import sqlite3
import pandas as pd
from lamp import anno
import lirtmats.lirtmats as rtm
Data Loading
LiRTMaTS supports text files separated by comma (,) or tab (\t). The Microsoft’s XLSX is also supported, using argument sheet_name to indicate which sheet is used for input data. The default is 0 for the first sheet.
Here we use a small example data set with tsv format. This data set includes peak list and intensity data matrix. LiRTMaTS requires peak list’s name, m/z value and retention time. User needs to indicate the locations of feature name, m/z value, retention time and starting points of data matrix from data. Here they are 1, 2, 3 and 4, respectively.
[2]:
cols = [1, 2, 3, 4]
data_fn = "./data/df_pos_3.tsv" # use tsv file
df = anno.read_peak(data_fn, cols, sep='\t')
df
[2]:
| name | mz | rt | D121 | A122 | A125 | A126 | A127 | A128 | B131 | ... | E214 | E215 | E216 | H234 | H235 | H236 | H237 | H238 | H239 | H240 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | M102T899 | 102.034153 | 898.850160 | 1.404584e+07 | 3.689953e+06 | 3.598363e+06 | 1.138875e+07 | 4.887524e+06 | 2.104782e+06 | 7.288258e+06 | ... | 3.125203e+06 | 3.608369e+06 | NaN | 4.763811e+06 | 2.281365e+06 | NaN | 3.404450e+06 | 3.720441e+06 | 4.539032e+05 | NaN |
| 1 | M102T849 | 102.034154 | 849.085350 | 1.473961e+07 | NaN | 5.934387e+06 | NaN | 4.607624e+06 | 5.969186e+06 | 3.367949e+06 | ... | 1.276006e+07 | 1.490770e+07 | 2.880142e+06 | 4.263577e+06 | NaN | NaN | NaN | 4.437697e+06 | 6.777076e+06 | 6.341930e+06 |
| 2 | M105T45 | 105.042677 | 45.353942 | 5.520865e+05 | 1.813279e+05 | 2.734923e+05 | 2.342655e+05 | 6.241395e+04 | 1.068277e+05 | 1.192451e+05 | ... | 3.092946e+04 | 1.788324e+05 | 1.810794e+05 | 3.225256e+05 | NaN | 3.734778e+05 | 1.935349e+05 | NaN | 1.094705e+05 | 1.946732e+05 |
| 3 | M105T54 | 105.054961 | 54.350049 | 6.669635e+05 | 4.833251e+06 | 2.137479e+06 | 1.552473e+06 | 1.753294e+06 | 2.301363e+06 | NaN | ... | 1.186390e+06 | 3.001167e+06 | 2.558921e+06 | NaN | NaN | 1.695460e+06 | NaN | 1.834140e+06 | 1.029692e+06 | 4.382618e+05 |
| 4 | M105T48_1 | 105.074216 | 47.538626 | 6.310113e+05 | NaN | 5.199302e+05 | 4.302566e+05 | 5.650141e+05 | 3.635406e+05 | 1.096530e+06 | ... | 7.882748e+05 | NaN | 9.822090e+05 | 4.974403e+05 | 3.604541e+05 | 1.340656e+06 | NaN | NaN | 6.020203e+05 | 3.597655e+05 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1995 | M299T296 | 299.233645 | 295.569540 | 8.125150e+04 | 1.020165e+05 | 2.209362e+05 | 3.557402e+05 | 6.039153e+05 | NaN | 2.330915e+05 | ... | 3.872671e+05 | 1.632064e+05 | 7.224218e+04 | 3.678394e+04 | 9.526812e+04 | 5.785549e+04 | 6.183749e+05 | NaN | 2.915690e+04 | NaN |
| 1996 | M300T43_1 | 299.919504 | 42.832066 | 5.042924e+04 | NaN | NaN | 2.222376e+05 | 3.763288e+05 | 2.094474e+05 | 1.163715e+05 | ... | 4.035525e+05 | 2.032260e+05 | 2.700920e+05 | NaN | 2.675647e+05 | 2.695188e+05 | 2.750383e+05 | 2.882957e+05 | 6.720465e+04 | 3.352428e+05 |
| 1997 | M300T62 | 300.119720 | 62.428854 | NaN | 3.914945e+05 | 5.182468e+05 | 7.492101e+05 | 1.546338e+06 | 5.741346e+05 | 9.712791e+05 | ... | 8.554399e+05 | 7.431820e+05 | 8.878200e+05 | NaN | 3.625514e+05 | 4.987110e+05 | 1.393237e+06 | 5.217566e+05 | NaN | 1.257126e+05 |
| 1998 | M300T285_2 | 300.124255 | 285.061758 | NaN | 4.602130e+05 | 4.559729e+05 | 9.718658e+05 | 3.864969e+05 | 3.877729e+05 | 1.315307e+06 | ... | 2.418197e+06 | 2.917536e+06 | 9.108396e+05 | 4.583314e+05 | 4.022556e+05 | 2.673259e+05 | NaN | NaN | 8.926295e+04 | 2.126753e+04 |
| 1999 | M300T288 | 300.181271 | 287.944377 | 7.880306e+05 | 1.738638e+06 | 1.113482e+06 | 4.063701e+06 | 3.788191e+06 | 1.201084e+06 | 2.988076e+06 | ... | 2.907005e+06 | 3.365814e+06 | 2.761628e+06 | 1.865813e+06 | 1.956308e+06 | NaN | 2.918514e+06 | NaN | NaN | NaN |
2000 rows × 40 columns
Data frame df now includes only name, mz, rt and intensity data matrix.
Retention Time Matching
To perform retention time matching, users use either default retention time library or their own reference file. The reference file must have one column: rt_lib which is used for retention time matching with a range or torrance in seconds. Also the column ion_mode should be required for indication of positive or negative mode matching. If ion_mode is not included in the reference file, all rows will be used for matching.
[3]:
ion_mode = "pos"
# ref_path = "" # if empty, use default reference file for matching
ref_path = "./data/rt_lib_202509.tsv"
ref = rtm.read_rt(ref_path, ion_mode=ion_mode)
ref
[3]:
| identifier | metabolite_name | rt_lib | inchikey | ion_mod | |
|---|---|---|---|---|---|
| 0 | ACMG_aqC18_POS_0001 | MS5029_Isovaleraldehyde | 24.6 | QPUYECUOLPXSFR-UHFFFAOYSA-N | positive |
| 1 | ACMG_aqC18_POS_0002 | LO57_Dihydroxyfumaric acid hydrate | 27.0 | SEKGMJVHSBBHRD-WZHZPDAFSA-M | positive |
| 2 | ACMG_aqC18_POS_0003 | LO61_Benzoic acid | 27.0 | DMBUODUULYCPAK-UHFFFAOYSA-N | positive |
| 3 | ACMG_aqC18_POS_0004 | LO52_Spermine | 28.2 | XDSPGKDYYRNYJI-IUPFWZBJSA-N | positive |
| 4 | ACMG_aqC18_POS_0005 | LO21_Spermidine | 30.0 | HELXLJCILKEWJH-NCGAPWICSA-N | positive |
| ... | ... | ... | ... | ... | ... |
| 2827 | ACMG_aqC18_POS_1412 | LIM3312_Cholesterol | 659.4 | ASOSVCXGWPDUGN-UHFFFAOYSA-N | negative |
| 2828 | ACMG_aqC18_POS_1413 | LO13_5alpha-Cholestan-3-one | 672.6 | XQCZBXHVTFVIFE-UHFFFAOYSA-N | negative |
| 2829 | ACMG_aqC18_POS_1414 | LIM3310_5alpha-Cholest-7-en-3beta-ol | 675.0 | WLFXSECCHULRRO-UHFFFAOYSA-N | negative |
| 2830 | ACMG_aqC18_POS_1415 | LO302_5alpha-Cholestanol | 681.6 | YCIMNLLNPGFGHC-UHFFFAOYSA-N | negative |
| 2831 | ACMG_aqC18_POS_1416 | LO45_10Z-Nonadecenoic acid | 723.6 | QIGBRXMKCJKVMJ-UHFFFAOYSA-N | negative |
2832 rows × 5 columns
rt_tol is a threshold for the retention time matching window. The unit is seconds and the default value is 5.
[4]:
rt_tol = 5
res = rtm.comp_match_rt(df, ref, rt_tol)
res
[4]:
| id | rt | identifier | metabolite_name | rt_lib | inchikey | ion_mod | rt_range | |
|---|---|---|---|---|---|---|---|---|
| 0 | M105T45 | 45.353942 | ACMG_aqC18_POS_0280 | LO309_Asymmetric dimethylarginine | 40.5 | ZDLDXNCMJBOYJV-YFKPBYRVSA-N | positive | 5 |
| 1 | M105T45 | 45.353942 | ACMG_aqC18_POS_0281 | MS5037_Ribonic acid gamma-lactone | 40.5 | DAUAQNGYDSHRET-UHFFFAOYSA-N | positive | 5 |
| 2 | M105T45 | 45.353942 | ACMG_aqC18_POS_0282 | LO18_L-Dihydroorotic acid | 40.5 | KCDXJAYRVLXPFO-UHFFFAOYSA-N | positive | 5 |
| 3 | M105T45 | 45.353942 | ACMG_aqC18_POS_0283 | LO30_Stachydrine | 40.5 | ITECRQOOEQWFPE-UHFFFAOYSA-N | positive | 5 |
| 4 | M105T45 | 45.353942 | ACMG_aqC18_POS_0284 | LO72_Aminoadipic acid | 40.5 | JYPHNHPXFNEZBR-UHFFFAOYSA-N | positive | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 150065 | M300T288 | 287.944377 | ACMG_aqC18_POS_0942 | MS5008_Ethyl crotonate | 291.0 | OZWKMVRBQXNZKK-UHFFFAOYSA-N | negative | 5 |
| 150066 | M300T288 | 287.944377 | ACMG_aqC18_POS_0943 | MS5032_2-Phenyl-1-propanol | 291.0 | DKYWVDODHFEZIM-UHFFFAOYSA-N | negative | 5 |
| 150067 | M300T288 | 287.944377 | ACMG_aqC18_POS_0944 | LO15_Methyl indole-3-acetate | 291.0 | RTIXKCRFFJGDFG-UHFFFAOYSA-N | negative | 5 |
| 150068 | M300T288 | 287.944377 | ACMG_aqC18_POS_0945 | LO03_Cinnamic aldehyde | 291.6 | FNYLWPVRPXGIIP-UHFFFAOYSA-N | positive | 5 |
| 150069 | M300T288 | 287.944377 | ACMG_aqC18_POS_0945 | LO03_Cinnamic aldehyde | 291.6 | FNYLWPVRPXGIIP-UHFFFAOYSA-N | negative | 5 |
150070 rows × 8 columns
Summarize Results
The function comp_summ in package LAMP summarises the retention time matching.
[5]:
sr, mr = anno.comp_summ(df, res)
This function combines peak table with retention time matching results and returns two results in different formats. sr is single row results for each peak id in peak table df:
[6]:
sr
[6]:
| name | mz | rt | rt_range | identifier | metabolite_name | rt_lib | inchikey | ion_mod | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | M100T54 | 100.075925 | 53.810924 | 5.0 | ACMG_aqC18_POS_0389::ACMG_aqC18_POS_0389::ACMG... | LO488_Maleic acid::LO488_Maleic acid::LO321_L-... | 48.9::48.9::49.2::49.2::50.4::50.4::50.4::50.4... | JJVNINGBHGBWJH-UHFFFAOYSA-N::JJVNINGBHGBWJH-UH... | positive::negative::positive::negative::positi... |
| 1 | M1015T254 | 1014.985384 | 253.626177 | 5.0 | ACMG_aqC18_POS_0782::ACMG_aqC18_POS_0783::ACMG... | LO481_3-Hydroxydecanedioic acid::LIM3308_Suber... | 249.00000000000003::249.00000000000003::249.00... | FVWJYYTZTCVBKE-ROUWMTJPSA-N::TVZGACDUOSZQKY-UH... | positive::positive::negative::negative::positi... |
| 2 | M101T228 | 101.060060 | 228.125403 | 5.0 | ACMG_aqC18_POS_0654::ACMG_aqC18_POS_0654::ACMG... | MS5018_Dimethyl maleate::MS5018_Dimethyl malea... | 223.2::223.2::223.8::223.8::223.8::223.8::223.... | KIWQWJKWBHZMDT-UHFFFAOYSA-N::KIWQWJKWBHZMDT-UH... | positive::negative::positive::positive::positi... |
| 3 | M102T849 | 102.034154 | 849.085350 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | M102T899 | 102.034153 | 898.850160 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1995 | M865T700 | 865.244172 | 700.365420 | NaN | NaN | NaN | NaN | NaN | NaN |
| 1996 | M919T647 | 918.701782 | 646.988220 | 5.0 | ACMG_aqC18_POS_1407::ACMG_aqC18_POS_1407::ACMG... | LO05_Vitamin K1::LO05_Vitamin K1::LIM3314_Phyl... | 642.6::642.6::643.2::643.2::647.1::647.1 | ZFDIRQKJPRINOQ-HYXAFXHYSA-N::ZFDIRQKJPRINOQ-HY... | positive::negative::positive::negative::positi... |
| 1997 | M925T237_1 | 924.898294 | 236.964462 | 5.0 | ACMG_aqC18_POS_0690::ACMG_aqC18_POS_0691::ACMG... | LO306_Syringic acid::LO315_ortho-Hydroxyphenyl... | 232.2::232.2::232.2::232.2::232.2::232.2::232.... | AFBPFSWMIHJQDM-UHFFFAOYSA-N::OISVCGZHLKNMSJ-UH... | positive::positive::positive::positive::positi... |
| 1998 | M933T267 | 933.410460 | 266.976471 | 5.0 | ACMG_aqC18_POS_0839::ACMG_aqC18_POS_0840::ACMG... | MS5012_Diethyl malonate::MS5019_Trimethylaceti... | 262.2::262.2::262.2::262.2::262.2::262.2::262.... | KEVYVLWNCKMXJX-UHFFFAOYSA-N::WTTJVINHCBCLGX-ZD... | positive::positive::positive::negative::negati... |
| 1999 | M934T242 | 933.932365 | 242.395371 | 5.0 | ACMG_aqC18_POS_0720::ACMG_aqC18_POS_0721::ACMG... | LIM3312_Aspartame::MS5023_Ethyl levulinate::MS... | 237.6::237.6::237.6::237.6::237.6::237.6::237.... | MBDOYVRWFFCFHM-SNAWJCMRSA-N::XPFVYQJUAUNWIW-UH... | positive::positive::positive::positive::negati... |
2000 rows × 9 columns
mr is multiple rows format if the match more than once from the reference file:
[7]:
mr
[7]:
| name | mz | rt | identifier | metabolite_name | rt_lib | inchikey | ion_mod | rt_range | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | M100T54 | 100.075925 | 53.810924 | ACMG_aqC18_POS_0389 | LO488_Maleic acid | 48.9 | JJVNINGBHGBWJH-UHFFFAOYSA-N | positive | 5.0 |
| 1 | M100T54 | 100.075925 | 53.810924 | ACMG_aqC18_POS_0389 | LO488_Maleic acid | 48.9 | JJVNINGBHGBWJH-UHFFFAOYSA-N | negative | 5.0 |
| 2 | M100T54 | 100.075925 | 53.810924 | ACMG_aqC18_POS_0390 | LO321_L-Theanine | 49.2 | SULYEHHGGXARJS-UHFFFAOYSA-N | positive | 5.0 |
| 3 | M100T54 | 100.075925 | 53.810924 | ACMG_aqC18_POS_0390 | LO321_L-Theanine | 49.2 | SULYEHHGGXARJS-UHFFFAOYSA-N | negative | 5.0 |
| 4 | M100T54 | 100.075925 | 53.810924 | ACMG_aqC18_POS_0391 | LO310_Dihydrothymine | 50.4 | YPTJKHVBDCRKNF-UHFFFAOYSA-N | positive | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 150218 | M934T242 | 933.932365 | 242.395371 | ACMG_aqC18_POS_0775 | LO35_2-Methoxybenzoic acid | 247.2 | RFKITWRHKUYMRJ-UHFFFAOYSA-N | positive | 5.0 |
| 150219 | M934T242 | 933.932365 | 242.395371 | ACMG_aqC18_POS_0772 | MS5015_Phenylglyoxal | 247.2 | QWIZNVHXZXRPDR-WSCXOGSTSA-N | negative | 5.0 |
| 150220 | M934T242 | 933.932365 | 242.395371 | ACMG_aqC18_POS_0773 | MS5021_Ethyl 2-methylacetoacetate | 247.2 | BHTRKEVKTKCXOH-LBSADWJPSA-N | negative | 5.0 |
| 150221 | M934T242 | 933.932365 | 242.395371 | ACMG_aqC18_POS_0774 | LO12_Homoveratrumic acid | 247.2 | SEBFKMXJBCUCAI-UHFFFAOYSA-N | negative | 5.0 |
| 150222 | M934T242 | 933.932365 | 242.395371 | ACMG_aqC18_POS_0775 | LO35_2-Methoxybenzoic acid | 247.2 | RFKITWRHKUYMRJ-UHFFFAOYSA-N | negative | 5.0 |
150223 rows × 9 columns
All of results can be saved into a sqlite3 database and use DB Browser for SQLite to view. Or save these results in other formats, such as TSV, CSV or XLSX, separately.
[8]:
f_save = False # here we do NOT save results
db_out = "test.db"
sr_out = "test_s.tsv"
mr_out = "test_m.tsv"
xlsx_out = "test.xlsx"
[9]:
if f_save:
# save all results into a sqlite3 database
conn = sqlite3.connect(db_out)
df[["name", "mz", "rt"]].to_sql("peaklist",
conn,
if_exists="replace",
index=False)
mr.to_sql("anno_mr", conn, if_exists="replace", index=False)
sr.to_sql("anno_sr", conn, if_exists="replace", index=False)
conn.commit()
conn.close()
# save results into text files
sr.to_csv(sr_out, sep="\t", index=False)
mr.to_csv(mr_out, sep="\t", index=False)
# save results into Excel format
with pd.ExcelWriter(xlsx_out, mode="w", engine="openpyxl") as writer:
sr.to_excel(writer, sheet_name="single-row", index=False)
mr.to_excel(writer, sheet_name="multiple-row", index=False)
It should be noted that saving of Excel file takes much longer time than text files.
End User Usages
LiRTMaTS provides two computation options: command line interface(CLI) and graphical user interface (GUI).
To use GUI, you need to open a terminal and type in:
$ lirtmats gui
To use CLI, open a terminal and type in command with required arguments, something like:
lirtmats cli \
--input-data "./data/df_pos_3.tsv" \
--input-sep "tab" \
--col-idx "1, 2, 3, 4" \
--rt-path "" \
--rt-sep "tab" \
--rt-tol "5.0" \
--ion-mode "pos" \
--save-db \
--summ-type "xlsx" \
Execution of this command line will produce df_pos_3_rtm.db and df_pos_3_rtm.xlsx in the directory ./data/. If the summ-type is tsv or csv, files df_pos_3_rtm_s.tsv or df_pos_3_rtm_s.csv and df_pos_3_rtm_m.tsv or df_pos_3_rtm_m.csv will be saved into ./data.
For the best practice, you can create a bash script .sh (Linux and MacOS) or Windows script .bat to contain these CLI arguments. Change parameters in these files each time when processing new data set.
For example, there are lirtmats_cli.sh and lirtmats_cli.bat in https://github.com/wanchanglin/lirtmats/tree/master/examples.
For Linux and MacOS terminal:
$ chmod +x lirtmats_cli.sh $ ./lirtmats_cli.sh
For Windows terminal:
$ lirtmats_cli.bat
Note that if users use xlsx files for input data and reference file when using GUI or CLI, all data must be in the first sheet. If you use LiRTMaTS functions in your python scripts, there are no such requirements.