In order for the Lagotto application to scale to millions of articles - e.g. the more than 10 million in the CrossRef Labs DET Server - it makes more sense that third-parties are pushing data into the application (push) rather than Lagotto collecting data from external sources (pull). We have identified the following architecture and implementation steps:
Add an API that takes push requests in a standardized format that describe events around articles. The API has the following features:
We want to separate out the agent functionality from our sources, so that agents can be part of the Lagotto software, or run somewhere else and deposit their data via the new push API. Sources should become generic enough that we hopefully don't need to subclass the Source class anymore, but move all that functionality into a new Agent model. In the beginning all sources will have a corresponding agent, but that can change over time.
All API responses from external sources should go through the new push API to make the workflow consistent. We can modify the perform_get_data method to achieve this.
Once we have separated out the agent functionality from sources in we can start rewriting our existing sources to more efficiently collect events from external sources. The F1000 source is a good starting point, and the new agent should parse the F1000 XML file and then deposit the payload in the new push API. We can consider packaging the internal agent as Ruby gem if the functionality is decoupled enough.
Use the standard webmention format, feed in data around events.
To fully support data-level metrics, the following changes need to be done in the Lagotto software:
This is an important feature for data because of versioning and subsets of data (isPartOf). This functionality is also needed for journal articles, allowing us to describe the relationship between different versions of an article, and related content such as corrections, as well as component DOIs for figures and tables.
As the number of Lagotto installations increases, we need to start thinking about server-to-server replication, so that multiple Lagotto servers are not all collecting the same information from external data sources.
To make this replication performant, we ideally want to use a native database replication tool. Part of the implementation should therefore be a re-evaluation of MySQL and CouchDB as databases used in Lagotto.
We want to make all data collected by Lagotto available publicly. While monthly reports can be generated as CSV files and uploaded to a data repository such as figshare, we need a different mechanism to include the raw data collected from external sources. A database is not the best place for this kind of data and we need to look at other services to handle this, e.g. fluentd and Amazon Glacier.