When searching for previously-unknown relationships of a nominated drug, probing the published literature is a formidable task. Clearly, such relationships are not discernible through simple observation, and they are not obvious and were not previously known. While the published literature is an invaluable source of data, there are several barriers that make this an inhospitable environment. By comparison with the kind of data stored in structured databases, natural language text within the published literature is unstructured, amorphous and difficult to mine using traditional algorithms. In practice, the greatest obstacle is missing data within a curated dataset that is small (thousands of data pieces) by comparison with the much larger (millions of data pieces) datasets mined using methods applied in commercial operations such as high street supermarket transactions. Signals from small datasets may be weak, rendering previously-unknown relationships difficult to detect. CEME overcomes the problem of missing data within small datasets by adapting the use of a statistical method known as ‘antecedent surrogate variables’. While this technique is discussed in an earlier News item, use of this method is restricted to a single therapeutic category such as postoperative pain, insomnia, asthma, etc.

Another method incorporated within the CEME algorithm is akin to “reverse engineering”. This method, while time-consuming has the advantage that it is not restricted to searching within a single therapeutic category. Notably, “reverse engineering” enhances the detection within small curated datasets of previously-unknown relationships, that otherwise would remain unknown.

If the nominated drug targets a signalling entity such as an enzyme, ion channel, receptor etc, then CEME can follow the trail of downstream mediators back upstream to the signalling entity targeted by the nominated drug. This method is described in brief below, and was recently used by McCormack Pharma in successfully showing that the cardioprotective drug dexrazoxane targets the nucleic acid, poly(ADP-ribose); poly(ADP-ribose) is synthesised by the protein poly(ADP-ribose) polymerase (read the published report).

Discovering the trail by reverse engineering

Consider the case where the drug under analysis interacts with a signalling entity that represents a previously-unknown target for this drug; this interaction results in measurable changes upon levels and/or activity of specific downstream mediators. Most importantly, these changes are likely to be inconsistent with prevailing dogma on the drug’s mechanism of action. Accordingly, we must consider the possibility that current teachings are flawed or untenable, and/or the existence of a mechanism of action about which we know nothing. In order to identify this target, the initial task is to list all mediators of signalling transduction when these are reported, from each publication in which the drug under analysis is a key word.

The working hypothesis of reverse engineering is that modulation of signalling at an upstream target by the drug under scrutiny induces adaptive reprogramming of signalling circuitry and architecture that becomes manifest as changes in levels and or/activity of specific downstream mediators. Signalling networks are context-specific. Consequently, within a similar physiological context, but without the drug under scrutiny, comparable changes in levels and or/activity of the same downstream mediators are the result of modulation, by either an endogenous or exogenous agent, of signalling at an upstream target that is also a target of the drug under scrutiny.

Three core tenets characterise this working hypothesis. The first tenet states that in a variety of cell types, and within different cytoplasmic and nuclear compartments, modulation of signalling at an upstream target (“signalling entity”) results in a diverse cascade of effects through the regulation of an array of numerous downstream mediators. Importantly, regulation of specific downstream mediators is critically context-dependent. On this basis, for an individual signalling entity, while several combinations of specific downstream mediators will be identified from different publications, within an identical or a similar physiological context, the trail upstream will lead to the same signalling entity. The second tenet asserts that in some contexts, redundancy in function may exist for some mediators within a combination. Consequently, the third tenet dictates that for a group of mediators within an identical or a similar physiological context, combinations of mediators from this group, devoid of redundant contributors, provide an upstream trail with a greater probability of locating the same signalling entity, thereby enhancing sensitivity in detection.

In the application of the above reverse engineering process the first step is to perform an All Fields PubMed search, in which the nominated drug is a keyword.

From the output of this search, publications are selected in which transduction pathways consisting of two or more mediators are reported within the text. For the purposes of illustration, mediators here are shown as letters of the alphabet, a, b, c, d, e, f……..where for example, a = nitric oxide, b = nuclear factor kappa B, c = protein kinase B, d = phosphoinositide 3-kinase, e = glycogen synthase kinase 3beta, f = mammalian target of rapamycin…….. From each publication, shown here as One, Two, Three, Four……all mediators are listed in no special order. Thus, as illustrated below, publication One contains five mediators, coded here as abxjp. These publications containing the drug under analysis are described herein as “parent publications”.

Parent One: abxjp

Parent Two: ajpilbx

Parent Three: mxq

Parent Four: bypxj

Parent Five: styu

etc

For each publication where the number of downstream mediators is two or more (n ≥ 2), CEME calculates the number of combinations (C) of n mediators taken r at a time using nCr = n!/r!(n-r)!

For example:

Parent publication One: There are five mediators (n = 5) abxjp

Using nCr = n!/r!(n-r)! and arranging the outputs in groups of four mediators (r = 4), three mediators (r = 3) and two mediators (r = 2) we have:

bxjp axjp abjp abxp abxj (r = 4)

abx abj abp axj axp ajp bxj bxp bjp xjp (r = 3)

ab ax aj ap bx bj bp xj xp jp (r = 2)

Total number of combinations, derived from parent publication One: Twenty-five

Omitting the drug under analysis, the next step is to perform All Fields searches using the combination within each parent publication, together with the sequentially-deficient combinations (r = 4, r = 3, r = 2 for example) derived from the parent combination. The basic searching format is illustrated below.

Searches are conducted in PubMed using a Boolean logic operator. Key words are chosen that characterize in a very broad context the principal action of the drug under analysis. While there is no limit on the number of key context words, the illustration below uses four. Using the cardioprotective drug dexrazoxane for the purpose of illustration, the context component input includes cardioprotect* OR cytoprotect* OR protect* OR survival OR autophagy, with wild card entries (*) where appropriate.

Search 1 using parent publication One All Fields (first context word ) OR (second context word) OR (third context word) OR (fourth context word) AND a AND b AND x AND j AND p NOT (drug under analysis)

Search 2 using r = 4 All Fields (first context word ) OR (second context word) OR (third context word) OR (fourth context word) AND b AND x AND j AND p NOT a (first mediator in parent combination abxjp)

Search 3 using r = 4 All Fields (first context word ) OR (second context word) OR (third context word) OR (fourth context word) AND a AND x AND j AND p NOT b (second mediator in parent combination abxjp)

Search 4 using r = 4 All Fields (first context word ) OR (second context word) OR (third context word) OR (fourth context word) AND a AND b AND j AND p NOT x (third mediator in parent combination abxjp)…………….and so forth for Searches 5 and 6……..

Search 7 using r = 3 All Fields (first context word ) OR (second context word) OR (third context word) OR (fourth context word) AND a AND b AND x NOT j (fourth mediator in parent combination abxjp) NOT p (fifth mediator in parent combination abxjp)

Search 8 using r = 3 All Fields (first context word ) OR (second context word) OR (third context word) OR (fourth context word) AND a AND b AND j NOT x (third mediator in parent combination abxjp) NOT p (fifth mediator in parent combination abxjp)………..and so forth for Searches 15 to 16….

Search 17 using r = 2 All Fields (first context word ) OR (second context word) OR (third context word) OR (fourth context word) AND a AND b NOT x (third mediator in parent combination abxjp) NOT j (fourth mediator in parent combination abxjp) NOT p (fifth mediator in parent combination abxjp)

Search 18 using r = 2 All Fields (first context word ) OR (second context word) OR (third context word) OR (fourth context word) AND a AND x NOT b (second mediator in parent combination abxjp) NOT j (fourth mediator in parent combination abxjp) NOT p (fifth mediator in parent combination abxjp)……..and so forth for Searches 19 to 26……

The above process is repeated for parent publication Two, Three etc.

From these new search outputs (1…..i), only those publications in which mediators show changes in the same direction and/or with comparable magnitude to the original parent publication are selected.

From these selected publications, the next stage is to assemble the data within a table whereby each horizontal row represents a publication. From each publication, “items” (mediators) are tabulated in a left to right manner with upstream mediators listed within the far left-hand-side columns followed by downstream mediators toward the right. Individual “items” are then extracted and transferred to a single row at the top of a new table and listed in a left to right sequence, with upstream mediators to the left, and “transactions” (publications) are listed in a single vertical column at the left hand side. Within this table each cell has a score of 1 or 0 according to whether the mediator is present in the “transaction” or absent, respectively. Subsequently, the objective is to identify association rules. An association rule of the form X => Y states that there is a correlation or association between occurrences of the itemset X, known as the left hand side or antecedent, and the itemset Y, known as the right hand side or consequent. In order to select interesting rules (previously-unknown relationships) from the set of all possible rules, constraints on various measures of significance and interest are used (to be continued).