Redis and MongoDB in the biomedical domain

Published

Varun Singh, a software engineer at IBM's Watson Health, looks at how to use MongoDB and Redis to manage biomedical data in this guest article.

For bioinformatics software engineers, unstructured data is a fact of everyday life. From centralized repositories containing information about biomedical entities such as genes and proteins to journal publishers providing their papers in PDF formats, unstructured data is the most common form of data available. Further, research in this domain is dominated by numerous academic institutions spread all over the world. The product of the research is often deposited in small databases hosted by these institutions. This leads to several problems the 2 most common ones being data model design and external dependencies.

Data Model Design

A logical place to start with when designing a data model is understanding the real world entities. A biological entity such as gene EGFR (Epidermal Growth Factor Receptor) ends up in several databases around the world because of its significance (a mutation in this gene is linked to lung cancer). It exists in the HUGO Gene Nomenclature Committee database with identifier 3236. It also exists in the protein database Uniprot with identifier P00533 because it encodes a protein with the same name - EGFR! Furthermore, this gene is also referenced by several synonyms such as ERBB, HER1, ERBB1 and NISBD2 etc in the literature.

When creating a data model, the primary key for a table GENE containing such genes could be a VARCHAR string or a numeric value such as 123. One would also need to create another table, say SYNONYMS, to hold the values of synonyms of genes and yet another table, say EXTERNAL_IDENTIFIERS to hold the database identifiers from HGNC and Uniprot. But there is a catch here - the identifier of a database such as HGNC is numeric while that of Uniprot is a string. It's easy to see how modeling just one entity gene is so cumbersome.

Taking into account other entities such as disease, proteins, mutations and unstructured literature text (containing evidence linking these entities), along with the need to maintain data consistency across tables one can see how immensely complicated it gets. This limits the to adoptability of traditional relational databases.

External dependencies

This is another obvious big issue. Since data comes from several disparate data sources, the process of refreshing it is a challenge in itself for several reasons.

First of all, access to these remote databases is often limited using quotas. These quotas can be based on IP addresses or on a per account basis to mitigate load on the server caused by automated scripts. Further since a lot of these in-house academic lab databases are not exactly production quality, one may encounter unplanned outages which could lead to disruption of work. In the case of unavailability, the only options are either to temporarily halt the update or do a partial update which carries the risk of introducing inconsistencies in the data.

A combination of a NoSQL database such as MongoDB and an in-memory data structure server such as Redis can help mitigate these issues and speed up the development time by reducing the complexity of the data models and caching data whenever an external data source is unavailable.

Why use MongoDB?

MongoDB is a NoSQL database. While not the first, it is the most popular NoSQL database in the world. Unlike SQL databases where similar data is stored as rows in tables, NoSQL databases treat such data as documents in collections. Documents are JSON-like data structures with no requirements placed on data types. While desirable, two documents in a collection don't even need to have similar structure. This flexible schema while being the antithesis of traditional SQL based design where a schema has to be created first before inserting data, is incredibly helpful in domains such as bioinformatic applications where data is often very unclean or unstructured. In the gene example above, it is very easy to add additional external database identifiers even if they are of different data types. These identifiers can be easily added to the EGFR document for example, as list elements even if they are of different data types. There is no danger of violating a foreign key constraint since all the identifiers are in the same document instead of in multiple tables joined by a foreign key as in a SQL database. Here is an example of creating a simple Gene in a collection called GENE in the database BIOMED:

use BIOMED  
db.GENE.insert(  
    {
        SYMBOL: "EGFR",
        SYNONYMS: ["ERBB", "HER1", "ERBB1"],
        IDENTIFIERS: [
            {"DB" : "HGNC", "ID" : 3236},
            {"DB" : "ENTREZ", "ID" : 1956},
            {"DB" : "RANDOM DATABASE", "ID" : "ABCDEF"}  //ID is a char string
        ]
    }
)

While creating one big table (or collection in MongoDB world) does lead to data denormalization (redundant data), it leads to easier data models and faster development time.

Another big advantage of using MongoDB is the availability of drivers and clients in several different programming languages including Python, Java and NodeJS. Biomedical engineers tend to use Python a lot since the language has extensive support for scientific and domain-specific libraries. The MongoDB API for Python can be used in a manner very similar to the native Mongo shell. This enables developers to test queries in the native Mongo shell and then easily migrate it over to Python. SQL databases, unfortunately, don't offer the same convenience. To illustrate this point, the following piece of code creates a GENE collection and then inserts a new gene using a Python shell:

from pymongo import MongoClient

client = MongoClient("localhost:27017")  
db = client["BIOMED"]  
db.GENE.insert(  
    {
        "SYMBOL" : "EGFR", 
        "SYNONYMS" : ["ERBB", "HER1", "ERBB1"],
        "IDENTIFIERS" : [
            {"DB" : "HGNC", "ID" : 3236},
            {"DB" : "ENTREZ", "ID" : 1956},
            {"DB" : "RANDOM DATABASE", "ID" : "ABCDEF"}
        ]
    }
)

As is evident from above, the syntax for the Mongo shell and Python shell are very similar. There are slight differences between the two syntaxes and one should read the Pymongo documentation thoroughly. For example, updating a document in Mongo shell, the syntax is one of the following: db.GENE.update(), db.GENE.updateOne() or db.GENE.updateMany(). Using Python on the other hand, there is no update but instead a update_one and update_many.

Along with the above features, MongoDB also offers built in replication and sharding for high availability without a very complex configuration. As such it is easy to use it in small academic environments which may lack dedicated DBAs.

What about Redis?

Now let's take a look at the other frequently encountered issue of dealing with access to databases where restrictions have been placed on the frequency of access either by design or due to inherent limitations based on the quality of the server. Pubmed is arguably the most important database for biomedical journal articles in the world. It contains several million journal articles with hundreds added every week. Pubmed is hosted by the National Center for Biomedical Informatics (NCBI), a research center within the National Institutes of Health, a US Government agency. NCBI facilitates research by providing databases like Pubmed along with dozens of others containing information about things like genes, proteins, mutations and so on. This data is used by scientists, researchers and software engineers alike for purposes as varied as research or for the development of biomedical web applications.

The official "Data Access Policy" for Pubmed states that automated scripts should only be run on weekends or between 9pm and 5am Eastern Time. Furthermore, there should be no more than 3 requests every 1 second. While such policies are reasonable from NCBI perspective since it allows more people to be able to access the resources, they are bad from a software development perspective. Redis can be used to mitigate the effects of such policies. As an example, consider a web application which allows the user to do a very customized search of the available biomedical literature. There are 3 potential ways in which this can be done:

  1. Develop an application with a backend database containing periodic snapshots of Pubmed. The database can be updated on a configurable schedule such as monthly. This has the advantage of complete independence from Pubmed but may not contain up to date information at a given time.

  2. Develop an application which merely provides a wrapper to Pubmed and any search that will be done using the application will be rerouted to Pubmed and the fetched results shown to users in a more customizable UI. This has the advantage of containing up to date information but suffers from the "Data Access Policy" restrictions as mentioned above

  3. Develop an application which provides a wrapper to Pubmed with an intermediate cache layer. This layer which can be easily implemented using Redis to cache queries sent to Pubmed along with the results containing document IDs. The actual unstructured documents can be stored in a NoSQL database such as Mongo. The advantage of this approach is that results from the queries are more up to date than the first approach but also don't suffer from the Data Access Policy restrictions from the second approach since the server can just show the cached results as needed so as to not violate the policies by overloading the Pubmed server with multiple requests containing the same query.

Now some may question the advantage of using Redis for caching as opposed to other techniques such as caching implemented by ORM layers like Hibernate. While other technologies may serve the purpose they have the disadvantage of being tightly coupled with the application. Since caching is not usually a core business logic, if possible it can be decoupled from the application. Furthermore, Redis is not merely a simple key-value store as some other technologies. It provides more complex data structures such as strings, hashes, lists, sets, sorted sets etc for ease of use within applications. Redis also has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster. It is also very easy to configure, and one can get started with Redis within minutes unlike more complicated application layer configurations.

Let's see some sample code on how to utilize MongoDB and Redis to interact with Pubmed in a simple Python application:

from pymongo import MongoClient  
from redis import ConnectionPool  
from redis import StrictRedis  
from requests import get  
import cherrypy

MAX_EXPIRE_DURATION = 24 * 3600  
PUBMED_EUTILS_URL = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=medline'

def configure(max_db_pool_size = 2, max_redis_connections = 2):  
    '''
    This method should be executed during web server startup
    '''
    if not globals()['client']:
        client = MongoClient(host = cherrypy.config["server.mongo_host"], port = cherrypy.config["server.mongo_port"], max_pool_size = max_db_pool_size)
        globals()['client'] = client

    if not globals()['cache_pool']:
        cache_pool = ConnectionPool(host = 'localhost', port = 6379, db = 0, max_connections = max_redis_connections)
        globals()['cache_pool'] = cache_pool

def get_db_connection():  
    '''
    This method should be called to get a connection to the MongoDB server
    '''
    return globals()['client']

def get_cache_connection():  
    '''
    This method should be called to get a connection to the Redis server  
    '''
    return StrictRedis(connection_pool = globals()['cache_pool'])

def search(query):  
    '''
    The method which performs the actual search. This method first checks the cache to see whether the key exists. If yes, then the documents corresponding to those Pubmed IDs are retrieved from Mongo, else an internet search is performed and the results are cached.
    '''
    cache = get_cache_connection()
    if cache.exists(query):
        publications = lookup_cache_key(query)
    else:
        pubmed_ids = pubmed_lookup(query)

def pubmed_lookup(query):  
    '''
    Implement this method to do an internet lookup on Pubmed and save the results to MongoDB
    '''
    pubmed_articles = get(PUBMED_EUTILS_URL + "?query=" + query)
    save_to_db(pubmed_articles)
    pubmed_ids = [ article["PMID"] for article in pubmed_articles ]
    save_query_to_cache(query, pubmed_ids)

def lookup_cache_key(cache_key):  
    '''
    Lookup a key in Redis cache
    '''
    cache = get_cache_connection()
    ids = cache.get(cache_key)
    publications = get_publications_from_pmids(ids)

    return publications

def get_publications_from_pmids(pubmed_ids):  
    '''
    Get the actual Pubmed articles from MongoDB
    '''
    db_client = get_db_connection()
    db = db_client["BIOMED"]
    return db.PUBMED.find({"ID" : {"$in" : pubmed_ids}})

def save_query_to_cache(cache_key, pubmed_ids):  
    '''
    Save the Pubmed IDs of documents into Redis cache
    '''
    cache = get_cache_connection()
    if pubmed_ids and len(pubmed_ids) > 0:
        cache.rpush(cache_key, pubmed_ids)
        cache.expire(cache_key, MAX_EXPIRE_DURATION)

Conclusion

Bioinformatics is a new field of study combining traditional IT with disciplines such as biomedical engineering, medicine, and genetics. With cheap computing available, more and more research and experiments are being done in silico. This is producing an almost unending supply of data being deposited in the cloud as well as on traditional servers. Managing such data while maintaining consistency and dealing with unexpected outages is a task so huge that it can take an army of DBAs and developers. While a combination of MongoDB and Redis is not going to solve every problem, it can definitely help by reducing the complexity of data models, easily distributing data across multiple machines for high availability and caching frequently used but mutable data to mitigate network outages while also keeping it up to date.

Varun Singh is a software engineer in the Watson for Genomics team in IBM where he is working on utilizing Machine Learning approaches to solving problems related to unstructured data and Natural Language Processing. He is very interested in issues related to representation, storage and processing of big and small data particularly in the bioinformatics domain.

attribution pixabay

Conquer the Data Layer

Spend your time developing apps, not managing databases.