When we integrate two systems, we use the REST APIs to access the data. Using these API we get access to the historical data and once we have processed all the historical data, we want to get any of the newer data - as time passes. The APIs present the data in the form of pages of records by the source system (for more you can see the Feed and Pagination patterns). The pagination is usually done based on the last modified date time of the record. This is quite simple for the consumer of the API to understand as the data is arranged in chronological order of change. But there is something interesting going on with the records that have been modified in the past few seconds. If these few seconds are not handled by the API provider correctly, then it can cause a lot of issues for the consumer of the API and/or lost records.
To understand this, let's start with a simple system. The table below describes the sequence of the last modified date time assigned to a record, it getting saved in the database, and then appearing in a paginated feed.
STEPS |
TIME |
EVENT |
S0 |
T0 |
The user saves a record |
S1 |
T1 |
Server application assigns current time = T1 value to the records last modified date time |
S2 |
T2 |
The record gets saved in the database |
S3 |
T3 |
This record appears on the page requested by the API user at time = T3 |
Let's see what happens to the records coming into the system from various users.
STEPS |
USER 1 R1 |
USER 2 R2 |
USER 3 R3 |
USER 4 R4 |
USER 5 R5 |
S0 |
T0 |
T100 |
T200 |
T300 |
T410 |
S1 |
T10 |
T110 |
T210 |
T310 |
T420 |
S2 |
T20 |
T120 |
T220 |
T320 |
T430 |
If an API user asks for all records newer than T0 and with pageSize = 3, the API will provide the data in two pages as [R1 (T10), R2 (T110), R3(T210)] followed by [R4 (T310), R5 (T430)] in step S3. For simplicity, we assumed that the time gap between various steps is the same across different users.
But this assumption cannot be made due to a few reasons - different amounts of data processed for each record, thread scheduling, IO availability, other processes interfering, etc. This lack of consistency in the time gap between S1 and S2 referred to as race condition, causes step 3 (S3) not to work as one would expect. Let’s see how.
Let’s use the scenario same as above but change the timings for USER 4 - note the time gap between steps.
STEPS |
USER 1 R1 |
USER 2 R2 |
USER 3 R3 |
USER 4 R4 |
USER 5 R5 |
S0 |
T00 |
T100 |
T200 |
T202 |
T410 |
S1 |
T10 |
T110 |
T210 |
T208 |
T420 |
S2 |
T20 |
T120 |
T220 |
T223 |
T430 |
Since there are a couple of common ways to implement pagination, both described here - we will try to see what resources we get in response for both approaches.
1. Using offset and limit
(a) When an API user tries to get the first page at T222 (limit=3, offset=0). The user would get R1, R2, and R3. Then the API user navigates to the next page at time T230.
(b) When an API user calls at T230, s/he will get R3 again on the second page at T230 (limit=3, offset=3). R3 & R5 both will appear after T430. Note that R4 will never appear for this user. But note that if the user came much later (let's say T500) for offset=0 and offset=3 - one will get all the records as {R1, R2, R4} and {R3, R4}.
2. Using seek pagination
(a) When an API user tries to get the first page at time T222 (limit=3, time < T00 ). The user would get R1, R2, and R3.
(b) Then the API user navigates to the next page at time T230 (limit=3, time=T220). In this case, the API user gets only R5 on the second page (again missing R3). Again here if the user comes later s/he will get all the records as above.
The main reason R4 is being missed out is that the timestamp assignment has happened but it is not visible to the API user yet as it is not saved. The caller of API and the system assigning timestamp plus saving are in a race condition with each other.
Consequence
In some software systems, missing odd records may not be an issue e.g. getting notifications. But in some systems e.g. in the business process flow, this could be a bigger problem. If one is trying to read through multiple such APIs from a source, of dependent entities (e.g. customer, account, transaction) then missing customer records will mean also not being able to process all accounts of such customer records. In a large graph of entities, this can have a cascading effect.
Solutions
API Provider
Once we are aware of this issue there can be a number of solutions we can come up with using messaging, another data store, etc. But there is also a simple solution that one can employ without increasing the complexity by adding additional infrastructure. We have already hinted at that above.
Since the records can be missed only when the API call and timestamp assignment are happening in the same real-time. What if we can avoid this? If the API provider cuts all resources from the paginated response that have been created/updated in let's say in the last 1 minute.
Assuming that 1 minute translates to 10 units of T, let's see what happens.
1. Using offset and limit
(a) at time T222 (limit=3, offset=0) user gets R1,R2
(b) at time T230 (limit=3, offset=2) user gets R4,R3,R5
2. Using seek pagination
(a) at time T222 (limit=3, time < T00) user gets R1,R2
(b) at time T230 (limit=3, time=120) user gets R4,R3,R5
In your system, you can decide what is the right time duration to use for what we have used 1 minute. This time duration essentially needs to be the maximum expected time duration between S1 and S2. As you can see, in most systems 1 minute will be sufficient.
API Consumer
It is possible that you are integrating with another system and you do not have a way to get this race condition fixed. In such a case as a consumer of the API, you can figure out a way to not process records that have timestamps up to the last minute (or time duration of your judgment). It is important to note that you will need to ascertain that the timestamp you are getting in the API resources is semantically the same as what you expect it to be. Basically, you should rule out timezone and other such issues - so that get timestamp matches the wall clock time.
Conclusion
This issue may seem trivial to consider and solve - but for someone to technically support such systems in production, it is quite important that it was never there.
Author: Vivek Singh
Published on: 19-July-2022