Quick Start

This python script describes how to use LiRTMaTS python package. The input data and retention time reference files used here are in https://github.com/wanchanglin/lirtmats/tree/master/examples/data.

Setup

The users need to load python package LAMP before using LiRTMaTS. It’s functions used here are for loading data set and summarising the matching results. For details, see https://github.com/wanchanglin/lamp.

[1]:

import sqlite3
import pandas as pd
from lamp import anno
import lirtmats.lirtmats as rtm

Data Loading

LiRTMaTS supports text files separated by comma (,) or tab (\t). The Microsoft’s XLSX is also supported, using argument sheet_name to indicate which sheet is used for input data. The default is 0 for the first sheet.

Here we use a small example data set with tsv format. This data set includes peak list and intensity data matrix. LiRTMaTS requires peak list’s name, m/z value and retention time. User needs to indicate the locations of feature name, m/z value, retention time and starting points of data matrix from data. Here they are 1, 2, 3 and 4, respectively.

[2]:

cols = [1, 2, 3, 4]
data_fn = "./data/df_pos_3.tsv"                 # use tsv file
df = anno.read_peak(data_fn, cols, sep='\t')
df

[2]:

	name	mz	rt	D121	A122	A125	A126	A127	A128	B131	...	E214	E215	E216	H234	H235	H236	H237	H238	H239	H240
0	M102T899	102.034153	898.850160	1.404584e+07	3.689953e+06	3.598363e+06	1.138875e+07	4.887524e+06	2.104782e+06	7.288258e+06	...	3.125203e+06	3.608369e+06	NaN	4.763811e+06	2.281365e+06	NaN	3.404450e+06	3.720441e+06	4.539032e+05	NaN
1	M102T849	102.034154	849.085350	1.473961e+07	NaN	5.934387e+06	NaN	4.607624e+06	5.969186e+06	3.367949e+06	...	1.276006e+07	1.490770e+07	2.880142e+06	4.263577e+06	NaN	NaN	NaN	4.437697e+06	6.777076e+06	6.341930e+06
2	M105T45	105.042677	45.353942	5.520865e+05	1.813279e+05	2.734923e+05	2.342655e+05	6.241395e+04	1.068277e+05	1.192451e+05	...	3.092946e+04	1.788324e+05	1.810794e+05	3.225256e+05	NaN	3.734778e+05	1.935349e+05	NaN	1.094705e+05	1.946732e+05
3	M105T54	105.054961	54.350049	6.669635e+05	4.833251e+06	2.137479e+06	1.552473e+06	1.753294e+06	2.301363e+06	NaN	...	1.186390e+06	3.001167e+06	2.558921e+06	NaN	NaN	1.695460e+06	NaN	1.834140e+06	1.029692e+06	4.382618e+05
4	M105T48_1	105.074216	47.538626	6.310113e+05	NaN	5.199302e+05	4.302566e+05	5.650141e+05	3.635406e+05	1.096530e+06	...	7.882748e+05	NaN	9.822090e+05	4.974403e+05	3.604541e+05	1.340656e+06	NaN	NaN	6.020203e+05	3.597655e+05
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1995	M299T296	299.233645	295.569540	8.125150e+04	1.020165e+05	2.209362e+05	3.557402e+05	6.039153e+05	NaN	2.330915e+05	...	3.872671e+05	1.632064e+05	7.224218e+04	3.678394e+04	9.526812e+04	5.785549e+04	6.183749e+05	NaN	2.915690e+04	NaN
1996	M300T43_1	299.919504	42.832066	5.042924e+04	NaN	NaN	2.222376e+05	3.763288e+05	2.094474e+05	1.163715e+05	...	4.035525e+05	2.032260e+05	2.700920e+05	NaN	2.675647e+05	2.695188e+05	2.750383e+05	2.882957e+05	6.720465e+04	3.352428e+05
1997	M300T62	300.119720	62.428854	NaN	3.914945e+05	5.182468e+05	7.492101e+05	1.546338e+06	5.741346e+05	9.712791e+05	...	8.554399e+05	7.431820e+05	8.878200e+05	NaN	3.625514e+05	4.987110e+05	1.393237e+06	5.217566e+05	NaN	1.257126e+05
1998	M300T285_2	300.124255	285.061758	NaN	4.602130e+05	4.559729e+05	9.718658e+05	3.864969e+05	3.877729e+05	1.315307e+06	...	2.418197e+06	2.917536e+06	9.108396e+05	4.583314e+05	4.022556e+05	2.673259e+05	NaN	NaN	8.926295e+04	2.126753e+04
1999	M300T288	300.181271	287.944377	7.880306e+05	1.738638e+06	1.113482e+06	4.063701e+06	3.788191e+06	1.201084e+06	2.988076e+06	...	2.907005e+06	3.365814e+06	2.761628e+06	1.865813e+06	1.956308e+06	NaN	2.918514e+06	NaN	NaN	NaN

2000 rows × 40 columns

Data frame df now includes only name, mz, rt and intensity data matrix.

Retention Time Matching

To perform retention time matching, users use either default retention time library or their own reference file. The reference file must have one column: rt_lib which is used for retention time matching with a range or torrance in seconds. Also the column ion_mode should be required for indication of positive or negative mode matching. If ion_mode is not included in the reference file, all rows will be used for matching.

[3]:

ion_mode = "pos"
# ref_path = ""  # if empty, use default reference file for matching
ref_path = "./data/rt_lib_202509.tsv"
ref = rtm.read_rt(ref_path, ion_mode=ion_mode)
ref

[3]:

	identifier	metabolite_name	rt_lib	inchikey	ion_mod
0	ACMG_aqC18_POS_0001	MS5029_Isovaleraldehyde	24.6	QPUYECUOLPXSFR-UHFFFAOYSA-N	positive
1	ACMG_aqC18_POS_0002	LO57_Dihydroxyfumaric acid hydrate	27.0	SEKGMJVHSBBHRD-WZHZPDAFSA-M	positive
2	ACMG_aqC18_POS_0003	LO61_Benzoic acid	27.0	DMBUODUULYCPAK-UHFFFAOYSA-N	positive
3	ACMG_aqC18_POS_0004	LO52_Spermine	28.2	XDSPGKDYYRNYJI-IUPFWZBJSA-N	positive
4	ACMG_aqC18_POS_0005	LO21_Spermidine	30.0	HELXLJCILKEWJH-NCGAPWICSA-N	positive
...	...	...	...	...	...
2827	ACMG_aqC18_POS_1412	LIM3312_Cholesterol	659.4	ASOSVCXGWPDUGN-UHFFFAOYSA-N	negative
2828	ACMG_aqC18_POS_1413	LO13_5alpha-Cholestan-3-one	672.6	XQCZBXHVTFVIFE-UHFFFAOYSA-N	negative
2829	ACMG_aqC18_POS_1414	LIM3310_5alpha-Cholest-7-en-3beta-ol	675.0	WLFXSECCHULRRO-UHFFFAOYSA-N	negative
2830	ACMG_aqC18_POS_1415	LO302_5alpha-Cholestanol	681.6	YCIMNLLNPGFGHC-UHFFFAOYSA-N	negative
2831	ACMG_aqC18_POS_1416	LO45_10Z-Nonadecenoic acid	723.6	QIGBRXMKCJKVMJ-UHFFFAOYSA-N	negative

2832 rows × 5 columns

rt_tol is a threshold for the retention time matching window. The unit is seconds and the default value is 5.

[4]:

rt_tol = 5
res = rtm.comp_match_rt(df, ref, rt_tol)
res

[4]:

	id	rt	identifier	metabolite_name	rt_lib	inchikey	ion_mod	rt_range
0	M105T45	45.353942	ACMG_aqC18_POS_0280	LO309_Asymmetric dimethylarginine	40.5	ZDLDXNCMJBOYJV-YFKPBYRVSA-N	positive	5
1	M105T45	45.353942	ACMG_aqC18_POS_0281	MS5037_Ribonic acid gamma-lactone	40.5	DAUAQNGYDSHRET-UHFFFAOYSA-N	positive	5
2	M105T45	45.353942	ACMG_aqC18_POS_0282	LO18_L-Dihydroorotic acid	40.5	KCDXJAYRVLXPFO-UHFFFAOYSA-N	positive	5
3	M105T45	45.353942	ACMG_aqC18_POS_0283	LO30_Stachydrine	40.5	ITECRQOOEQWFPE-UHFFFAOYSA-N	positive	5
4	M105T45	45.353942	ACMG_aqC18_POS_0284	LO72_Aminoadipic acid	40.5	JYPHNHPXFNEZBR-UHFFFAOYSA-N	positive	5
...	...	...	...	...	...	...	...	...
150065	M300T288	287.944377	ACMG_aqC18_POS_0942	MS5008_Ethyl crotonate	291.0	OZWKMVRBQXNZKK-UHFFFAOYSA-N	negative	5
150066	M300T288	287.944377	ACMG_aqC18_POS_0943	MS5032_2-Phenyl-1-propanol	291.0	DKYWVDODHFEZIM-UHFFFAOYSA-N	negative	5
150067	M300T288	287.944377	ACMG_aqC18_POS_0944	LO15_Methyl indole-3-acetate	291.0	RTIXKCRFFJGDFG-UHFFFAOYSA-N	negative	5
150068	M300T288	287.944377	ACMG_aqC18_POS_0945	LO03_Cinnamic aldehyde	291.6	FNYLWPVRPXGIIP-UHFFFAOYSA-N	positive	5
150069	M300T288	287.944377	ACMG_aqC18_POS_0945	LO03_Cinnamic aldehyde	291.6	FNYLWPVRPXGIIP-UHFFFAOYSA-N	negative	5

150070 rows × 8 columns

Summarize Results

The function comp_summ in package LAMP summarises the retention time matching.

[5]:

sr, mr = anno.comp_summ(df, res)

This function combines peak table with retention time matching results and returns two results in different formats. sr is single row results for each peak id in peak table df:

[6]:

sr

[6]:

	name	mz	rt	rt_range	identifier	metabolite_name	rt_lib	inchikey	ion_mod
0	M100T54	100.075925	53.810924	5.0	ACMG_aqC18_POS_0389::ACMG_aqC18_POS_0389::ACMG...	LO488_Maleic acid::LO488_Maleic acid::LO321_L-...	48.9::48.9::49.2::49.2::50.4::50.4::50.4::50.4...	JJVNINGBHGBWJH-UHFFFAOYSA-N::JJVNINGBHGBWJH-UH...	positive::negative::positive::negative::positi...
1	M1015T254	1014.985384	253.626177	5.0	ACMG_aqC18_POS_0782::ACMG_aqC18_POS_0783::ACMG...	LO481_3-Hydroxydecanedioic acid::LIM3308_Suber...	249.00000000000003::249.00000000000003::249.00...	FVWJYYTZTCVBKE-ROUWMTJPSA-N::TVZGACDUOSZQKY-UH...	positive::positive::negative::negative::positi...
2	M101T228	101.060060	228.125403	5.0	ACMG_aqC18_POS_0654::ACMG_aqC18_POS_0654::ACMG...	MS5018_Dimethyl maleate::MS5018_Dimethyl malea...	223.2::223.2::223.8::223.8::223.8::223.8::223....	KIWQWJKWBHZMDT-UHFFFAOYSA-N::KIWQWJKWBHZMDT-UH...	positive::negative::positive::positive::positi...
3	M102T849	102.034154	849.085350	NaN	NaN	NaN	NaN	NaN	NaN
4	M102T899	102.034153	898.850160	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...
1995	M865T700	865.244172	700.365420	NaN	NaN	NaN	NaN	NaN	NaN
1996	M919T647	918.701782	646.988220	5.0	ACMG_aqC18_POS_1407::ACMG_aqC18_POS_1407::ACMG...	LO05_Vitamin K1::LO05_Vitamin K1::LIM3314_Phyl...	642.6::642.6::643.2::643.2::647.1::647.1	ZFDIRQKJPRINOQ-HYXAFXHYSA-N::ZFDIRQKJPRINOQ-HY...	positive::negative::positive::negative::positi...
1997	M925T237_1	924.898294	236.964462	5.0	ACMG_aqC18_POS_0690::ACMG_aqC18_POS_0691::ACMG...	LO306_Syringic acid::LO315_ortho-Hydroxyphenyl...	232.2::232.2::232.2::232.2::232.2::232.2::232....	AFBPFSWMIHJQDM-UHFFFAOYSA-N::OISVCGZHLKNMSJ-UH...	positive::positive::positive::positive::positi...
1998	M933T267	933.410460	266.976471	5.0	ACMG_aqC18_POS_0839::ACMG_aqC18_POS_0840::ACMG...	MS5012_Diethyl malonate::MS5019_Trimethylaceti...	262.2::262.2::262.2::262.2::262.2::262.2::262....	KEVYVLWNCKMXJX-UHFFFAOYSA-N::WTTJVINHCBCLGX-ZD...	positive::positive::positive::negative::negati...
1999	M934T242	933.932365	242.395371	5.0	ACMG_aqC18_POS_0720::ACMG_aqC18_POS_0721::ACMG...	LIM3312_Aspartame::MS5023_Ethyl levulinate::MS...	237.6::237.6::237.6::237.6::237.6::237.6::237....	MBDOYVRWFFCFHM-SNAWJCMRSA-N::XPFVYQJUAUNWIW-UH...	positive::positive::positive::positive::negati...

2000 rows × 9 columns

mr is multiple rows format if the match more than once from the reference file:

[7]:

mr

[7]:

	name	mz	rt	identifier	metabolite_name	rt_lib	inchikey	ion_mod	rt_range
0	M100T54	100.075925	53.810924	ACMG_aqC18_POS_0389	LO488_Maleic acid	48.9	JJVNINGBHGBWJH-UHFFFAOYSA-N	positive	5.0
1	M100T54	100.075925	53.810924	ACMG_aqC18_POS_0389	LO488_Maleic acid	48.9	JJVNINGBHGBWJH-UHFFFAOYSA-N	negative	5.0
2	M100T54	100.075925	53.810924	ACMG_aqC18_POS_0390	LO321_L-Theanine	49.2	SULYEHHGGXARJS-UHFFFAOYSA-N	positive	5.0
3	M100T54	100.075925	53.810924	ACMG_aqC18_POS_0390	LO321_L-Theanine	49.2	SULYEHHGGXARJS-UHFFFAOYSA-N	negative	5.0
4	M100T54	100.075925	53.810924	ACMG_aqC18_POS_0391	LO310_Dihydrothymine	50.4	YPTJKHVBDCRKNF-UHFFFAOYSA-N	positive	5.0
...	...	...	...	...	...	...	...	...	...
150218	M934T242	933.932365	242.395371	ACMG_aqC18_POS_0775	LO35_2-Methoxybenzoic acid	247.2	RFKITWRHKUYMRJ-UHFFFAOYSA-N	positive	5.0
150219	M934T242	933.932365	242.395371	ACMG_aqC18_POS_0772	MS5015_Phenylglyoxal	247.2	QWIZNVHXZXRPDR-WSCXOGSTSA-N	negative	5.0
150220	M934T242	933.932365	242.395371	ACMG_aqC18_POS_0773	MS5021_Ethyl 2-methylacetoacetate	247.2	BHTRKEVKTKCXOH-LBSADWJPSA-N	negative	5.0
150221	M934T242	933.932365	242.395371	ACMG_aqC18_POS_0774	LO12_Homoveratrumic acid	247.2	SEBFKMXJBCUCAI-UHFFFAOYSA-N	negative	5.0
150222	M934T242	933.932365	242.395371	ACMG_aqC18_POS_0775	LO35_2-Methoxybenzoic acid	247.2	RFKITWRHKUYMRJ-UHFFFAOYSA-N	negative	5.0

150223 rows × 9 columns

All of results can be saved into a sqlite3 database and use DB Browser for SQLite to view. Or save these results in other formats, such as TSV, CSV or XLSX, separately.

[8]:

f_save = False          # here we do NOT save results
db_out = "test.db"
sr_out = "test_s.tsv"
mr_out = "test_m.tsv"
xlsx_out = "test.xlsx"

[9]:

if f_save:
    # save all results into a sqlite3 database
    conn = sqlite3.connect(db_out)
    df[["name", "mz", "rt"]].to_sql("peaklist",
                                    conn,
                                    if_exists="replace",
                                    index=False)
    mr.to_sql("anno_mr", conn, if_exists="replace", index=False)
    sr.to_sql("anno_sr", conn, if_exists="replace", index=False)

    conn.commit()
    conn.close()

    # save results into text files
    sr.to_csv(sr_out, sep="\t", index=False)
    mr.to_csv(mr_out, sep="\t", index=False)

    # save results into Excel format
    with pd.ExcelWriter(xlsx_out, mode="w", engine="openpyxl") as writer:
        sr.to_excel(writer, sheet_name="single-row", index=False)
        mr.to_excel(writer, sheet_name="multiple-row", index=False)

It should be noted that saving of Excel file takes much longer time than text files.

End User Usages

LiRTMaTS provides two computation options: command line interface(CLI) and graphical user interface (GUI).

To use GUI, you need to open a terminal and type in:

$ lirtmats gui

To use CLI, open a terminal and type in command with required arguments, something like:

lirtmats cli \
  --input-data "./data/df_pos_3.tsv" \
  --input-sep "tab" \
  --col-idx "1, 2, 3, 4" \
  --rt-path "" \
  --rt-sep "tab" \
  --rt-tol "5.0" \
  --ion-mode "pos" \
  --save-db \
  --summ-type "xlsx" \

Execution of this command line will produce df_pos_3_rtm.db and df_pos_3_rtm.xlsx in the directory ./data/. If the summ-type is tsv or csv, files df_pos_3_rtm_s.tsv or df_pos_3_rtm_s.csv and df_pos_3_rtm_m.tsv or df_pos_3_rtm_m.csv will be saved into ./data.

For the best practice, you can create a bash script .sh (Linux and MacOS) or Windows script .bat to contain these CLI arguments. Change parameters in these files each time when processing new data set.

For example, there are lirtmats_cli.sh and lirtmats_cli.bat in https://github.com/wanchanglin/lirtmats/tree/master/examples.

For Linux and MacOS terminal:

$ chmod +x lirtmats_cli.sh
$ ./lirtmats_cli.sh

For Windows terminal:
```
$ lirtmats_cli.bat
```

Note that if users use xlsx files for input data and reference file when using GUI or CLI, all data must be in the first sheet. If you use LiRTMaTS functions in your python scripts, there are no such requirements.