Using the ChEMBL ORM#

In this notebook there are some code examples illustrating how to use the object relational mapper to query the ChEMBL database. You do not have to use this, some people may choose to query the database directly. However, it is useful if you want your queries to be largly agnostic of the database backend.

We will use the SQLite ChEMBL distribution that can be downloaded from ChEMBL downloads. This is a tar.gz, after downloaded, do tar -xzvf <tar file> and in the extracted directory there should be a database file names chembl_<version>.db. This is the SQLite database file and contains a full copy of ChEMBL. There are other database distributions on the download site, this code should work with those as well, the only thing you will have to change is the connection URL in the call to create_engine.

[1]:
from chembl_orm import orm as o
from sqlalchemy import create_engine, and_
from sqlalchemy.orm import sessionmaker

Get the database and connect#

First we will define the location of the database, so we can open it using SQLAlchemy, your location will be different.

[2]:
db_path = "/data/chembl/chembl_33.db"

Now we will use SQLAlchemy to open the database, see the SQLAlchemy docs for details. At present the chembl-orm package uses SQLAlchemy 1.3 but will be updated in future.

[3]:
engine = create_engine(f"sqlite:///{db_path}")

# create a configured "Session" class
Session = sessionmaker(bind=engine)

# create a Session
session = Session()

Some basic queries#

Now you are good to go. With the SQLAlchemy ORM, you can issue queries joining tables, or use it similar to an API, traversing relationships between objects.

You can also perform simple count operations on query objects

[4]:
# Count the number of molecules in the database, the real version has over 1M concepts
session.query(o.MoleculeDictionary).count()
[4]:
2399743
[5]:
# Count the number of targets
session.query(o.TargetDictionary).count()
[5]:
15398

Have a look at some descriptions, this will return target dictionary objects

[6]:
# using limit to reduct the number of results to 2
q = session.query(o.TargetDictionary).limit(2)

for row in q:
    print(row)
<TargetDictionary(tid=1, target_type=SINGLE PROTEIN, pref_name=Maltase-glucoamylase, tax_id=9606, organism=Homo sapiens, chembl_id=CHEMBL2074, species_group_flag=False)>
<TargetDictionary(tid=2, target_type=SINGLE PROTEIN, pref_name=Sulfonylurea receptor 2, tax_id=9606, organism=Homo sapiens, chembl_id=CHEMBL1971, species_group_flag=False)>

You do not have to return all of the columns, i.e. a whole object representing the table. You can ask for specific column names.

[7]:
for row in session.query(o.TargetDictionary.chembl_id, o.TargetDictionary.pref_name).limit(2):
    print(row.chembl_id, row.pref_name)
CHEMBL2074 Maltase-glucoamylase
CHEMBL1971 Sulfonylurea receptor 2

You can add filters, which are like the where clause in an SQL query. Here we are limiting the targets to human targets AND protein complexes. This also demonstrates how to format the query code as the queries get longer.

[8]:
sql = session.query(
    o.TargetDictionary.chembl_id,
    o.TargetDictionary.pref_name,
    o.TargetDictionary.target_type,
    o.TargetDictionary.organism
).filter(
    and_(
        o.TargetDictionary.target_type == 'PROTEIN COMPLEX',
        o.TargetDictionary.organism == 'Homo sapiens',
    )
).limit(2)

# We will store the ChEMBL IDs and use them below
chembl_ids = []
for row in sql:
    chembl_ids.append(row.chembl_id)
    print(row)
('CHEMBL612409', 'Anti-estrogen binding site (AEBS)', 'PROTEIN COMPLEX', 'Homo sapiens')
('CHEMBL1907588', 'Acetylcholine receptor; alpha1/beta1/delta/gamma', 'PROTEIN COMPLEX', 'Homo sapiens')

We can add one or more joins, here we are looking at the individual componenets of our targets. In this case they are individual proteins and we have got the SwissProt identifier for them. Obviously with this you need to know the database schema, to know which columns to join.

[9]:
sql = session.query(
    o.TargetDictionary.chembl_id,
    o.TargetDictionary.pref_name,
    o.TargetDictionary.target_type,
    o.TargetDictionary.organism,
    o.ComponentSequences.db_source,
    o.ComponentSequences.accession
).join(
    o.TargetComponents,
    o.TargetComponents.tid == o.TargetDictionary.tid
).join(
    o.ComponentSequences,
    o.ComponentSequences.component_id == o.TargetComponents.component_id
).filter(o.TargetDictionary.chembl_id.in_(chembl_ids))

# This will be used for the demonstration query below
for row in sql:
    print(row)
('CHEMBL1907588', 'Acetylcholine receptor; alpha1/beta1/delta/gamma', 'PROTEIN COMPLEX', 'Homo sapiens', 'SWISS-PROT', 'P02708')
('CHEMBL1907588', 'Acetylcholine receptor; alpha1/beta1/delta/gamma', 'PROTEIN COMPLEX', 'Homo sapiens', 'SWISS-PROT', 'P07510')
('CHEMBL1907588', 'Acetylcholine receptor; alpha1/beta1/delta/gamma', 'PROTEIN COMPLEX', 'Homo sapiens', 'SWISS-PROT', 'P11230')
('CHEMBL1907588', 'Acetylcholine receptor; alpha1/beta1/delta/gamma', 'PROTEIN COMPLEX', 'Homo sapiens', 'SWISS-PROT', 'Q07001')
('CHEMBL612409', 'Anti-estrogen binding site (AEBS)', 'PROTEIN COMPLEX', 'Homo sapiens', 'SWISS-PROT', 'Q15125')
('CHEMBL612409', 'Anti-estrogen binding site (AEBS)', 'PROTEIN COMPLEX', 'Homo sapiens', 'SWISS-PROT', 'Q9UBM7')

We can also use the objects themselves to do a similar thing. So the ORM classes act in a similar way to a dedicated API. Here we are traversing to the relationships for the specific ChEMBL Ids that we stored earlier and we are gathering the same information. Here we do not need to know as much about the schema, as we are relying on the pre-defined relationships within the ORM.

[10]:
sql = session.query(
    o.TargetDictionary
).filter(
    o.TargetDictionary.chembl_id.in_(chembl_ids)
)

for row in sql:
    print(f"=== ChEMBL ID: {row.chembl_id} ===")
    for comp in row.target_components:
        print(f" * {comp.component_sequences.db_source} > {comp.component_sequences.accession}")

=== ChEMBL ID: CHEMBL1907588 ===
 * SWISS-PROT > P02708
 * SWISS-PROT > P07510
 * SWISS-PROT > P11230
 * SWISS-PROT > Q07001
=== ChEMBL ID: CHEMBL612409 ===
 * SWISS-PROT > Q15125
 * SWISS-PROT > Q9UBM7

Summary#

So this is just to give you an idea of how to connect and issue basic queries ising the ORM classes defined in the chembl-orm package. When I get chance, I will add some more informative queries in here.