In order for the Lagotto application to scale to millions of articles - e.g. the more than 10 million in the CrossRef Labs DET Server - it makes more sense that third-parties are pushing data into the application (push) rather than Lagotto collecting data from external sources (pull). We have identified the following architecture and implementation steps:

Add push API endpoint

Add an API that takes push requests in a standardized format that describe events around articles. The API has the following features:

  • HTTP REST (POST and possibly GET)
  • allow to push events for a single or multiple articles
  • include at least the following information in the payload: article ID (DOI), source, timestamp, event information, depending on source (e.g. event_count, event_url, information about individual events)
  • authentication via API token

Separate out agent functionality from source model

We want to separate out the agent functionality from our sources, so that agents can be part of the Lagotto software, or run somewhere else and deposit their data via the new push API. Sources should become generic enough that we hopefully don't need to subclass the Source class anymore, but move all that functionality into a new Agent model. In the beginning all sources will have a corresponding agent, but that can change over time.

Push all API responses through push API

All API responses from external sources should go through the new push API to make the workflow consistent. We can modify the perform_get_data method to achieve this.

Rewrite F1000 source as internal agent

Once we have separated out the agent functionality from sources in we can start rewriting our existing sources to more efficiently collect events from external sources. The F1000 source is a good starting point, and the new agent should parse the F1000 XML file and then deposit the payload in the new push API. We can consider packaging the internal agent as Ruby gem if the functionality is decoupled enough.

Add generic webmention endpoint

Use the standard webmention format, feed in data around events.

To fully support data-level metrics, the following changes need to be done in the Lagotto software:

  • support for relationships between resources (isNewVersionOf, isPartOf, etc.)
  • configuration changes to some sources, e.g. Europe PMC database links
  • additional sources (e.g. usage stats for data)

Support for relationships between resources

This is an important feature for data because of versioning and subsets of data (isPartOf). This functionality is also needed for journal articles, allowing us to describe the relationship between different versions of an article, and related content such as corrections, as well as component DOIs for figures and tables.

As the number of Lagotto installations increases, we need to start thinking about server-to-server replication, so that multiple Lagotto servers are not all collecting the same information from external data sources.

To make this replication performant, we ideally want to use a native database replication tool. Part of the implementation should therefore be a re-evaluation of MySQL and CouchDB as databases used in Lagotto.

We want to make all data collected by Lagotto available publicly. While monthly reports can be generated as CSV files and uploaded to a data repository such as figshare, we need a different mechanism to include the raw data collected from external sources. A database is not the best place for this kind of data and we need to look at other services to handle this, e.g. fluentd and Amazon Glacier.