What is restez
?
R packages for interacting with the National Center for Biotechnology Information (NCBI) have, to-date, depended on API query calls via NCBI’s Entrez.For computational analyses that require the automated look-up of reams of biological sequence data, piecemeal querying via bandwith-limited requests is evidently not ideal. These queries are not only slow, but they depend on network connections and the remote server’s consistent behaviour. Additionally, users who make very large requests over extended periods of time run the risk of being blocked.
restez
attempts to make large queries to NCBI GenBank more efficient by allowing users to download whole sections of GenBank, create a local database from these downloaded files and then query this mini-GenBank version instead.This process is far more efficient as the downloaded files are compressed and users can limit the size of the database by only creating itfrom sequences of interest (limiting by taxonomic domain and/or sequence size).
restez
tries to be user-friendly: a database can be set up in just a few function calls (set path, download and create),a database can be queried with a consistent set of functions (the gb_*_get()
functions),the number of arguments per function is limited, and the package is designed to integrate with pre-exisiting R packagesthat interact with NCBI (rentrez
andphylotaR
)
For more a detailed description and for tutorials of the package, please visit therestez
website.
Figure 1. Diagrammatic outline of the restez
functions and folder structure. Data is downloaded from NCBI into a file pathset by the user. Raw downloads are stored in “downloads/” the generated database is stored in “sql_db/“. This database canthen be queried with a series of gb_*_get()
functions as well as some additional wrappers.
Installation
restez
(v1.0.0) is available from CRAN.
1 |
|
Alternatively, the latest development version can be downloaded from restez
’s GitHub page.
1 |
|
Usage
To walk us through the basics of restez
let’s pretend we’re a microbiologist interested in the sequence diversity among all the bacteriophages that infect Escherichia bacteria. First, we will first need to create our sequence database. Second, we will need to identify the Escherichia phage sequences in this database. And then finally we will need to write out the sequences in a suitable format for a sequence diversity analysis. Thankfully, restez
can perform all these steps.
Set-up
To get started with restez
we first have to download and create a database. This set-up consists of three steps:
Depending on how many GenBank files you select to download, the above process can take up to several hours.In this example, however, we will only download and set up a database for ‘phage’ sequenceswhich should take between 5-10 mintues depending on your machine and internet connection.
Path
1 |
|
The above code will set the restez
path. All downloaded files and the created database will be stored in this path.(Keep a note of the restez
path, you will need it later.)
Download
To download sequences, run the interactive function db_download()
.
1 |
|
This will produce a list of options, like this:
1 |
|
We can download all phage sequences by typing 19
. After pressing Enter, we will be told of the likely total file sizeof the download. If you have enough free space, push any key to continue. This will then initiate a download processfor all phage sequences files on GenBank.
Create
After the download process has completed, we can create the database with db_create()
.
1 |
|
This will add all of the downloaded files to the database. It will take a while to complete. When it finishes, we can then identify our Escherichia phage sequences!
Querying
Status
After we have built the database, we can query it! For every new R session, we will always need to point restez
to thedatabase using restez_path_set()
and then connect to the database with restez_connect()
. To get started, let’s seethe database status, is it ready for querying?
1 |
|
1 |
|
The above status report tells us the database, exists, has data and is connected – which means it’s ready for queries.(To get a simple TRUE or FALSE for whether the database is ready, userestez_ready()
.)
Get-tools
restez
comes with a series of gb_*_get()
functions for parsing the GenBank records to pull out specific elements (sequences, definition lines, whole records). We can find records in the database using Accession IDs. To list all Accession IDs in a database, we can use list_db_ids()
.
1 |
|
Escherichia phage sequences
In our scenario, we’re interested in finding and writing out all the Escherichia phage sequences. We can do this by looking up the organism names of the sequence sources of all the sequences in the database. We can then parse these names for "escherichia"
and write out the resulting list of sequences.
1 |
|
In this little example, we could identify our sequences of interest using restez
itself. Ordinarily, however, because sequences can only be looked up via Accession IDs, users will probably not use restez
for sequence discovery, only retrieval. For a more adaptable example of searching and fetching sequences, see How to search and fetch sequences
Integrations
To minimise the coding effort on the part of a user, restez
has been built to work with R packages that already connect toNCBI’s Entrez. After setting up a restez
database the same functions of theseother packages can be used to query NCBI Entrez. Internally, restez
will query its local database and if it cannot find allof the requested sequences it will pass these arguments on to these other packages.
For example, users can use the entrez_fetch()
function of the rentrez
package. Runningthis function through restez
means a user can first check the local database rather than make lots of queries over the internet.The function arguments are exactly the same.Additionally, users can set up up a restez
database before launching a phylotaR
run.phylotaR
searches NCBI for orthologous sequence clusters for a given taxonomic ID. If a restez
database is set-up, phylotaR
will first search the local database before downloading via Entrez.
For more information on these integrations see the additional documentation:rentrez
and restez
andphylotaR
and restez
Future
We have many ideas for improving restez
and we welcome forks and pull requests! Our current list of ideas for improvement, include:
-
Protein database – the current code could be easily duplicated for working with protein databases, not just GenBank.
-
Taxonomy – integration of existing taxonomic packages with
restez
. -
Retmodes –
restez
only supports text-based return modes, it could be expanded to include XML.
Please see the contributing page for more details and any updates.
If you have any ideas of your own for new features than please open a new issue.
Acknowledgements
Big thanks to Evan Eskew and Naupaka Zimmerman for reviewing the package;and to Carl Boettiger and Noam Ross for useful comments during the review;and, of course, to Scott Chamberlain for editing!
Useful Links
Reference
Bennett, D.J., Hettling, H., Silvestro, D., Vos, R. and Antonelli, A. 2018. 2018. restez: Create and Query a Local Copy of GenBank in R. Journal of Open Source Software, 3(31), 1102, https://doi.org/10.21105/joss.01102
Related