This is a nontrivial problem when using properly modularized code and libraries that perform logging. They can’t tell whether their operational error is also a program-level error, which can depend on usage context, but they still want to log the operational error themselves, in order to provide the details that aren’t accessible to higher-level code. This lower-level logging has to choose some status.
Should only “top-level” code ever log an error? That can make it difficult to identify the low-level root causes of a top-level failure. It also can hamper modularization, because it means you can’t repackage one program’s high-level code as a library for use by other programs, without somehow factoring out the logging code again.
A common problem in Java is someone will drop a log that looks something like this `log.trace("Doing " + foo + " to " + bar);`
The problem is, especially in a hot loop, that throw away string concatenation can ultimately be a performance problem. Especially if `foo` or `bar` have particularly expensive `toString` functions.
The proper way to do something like this in java is either
log.trace("Doing $1 to $2", foo, bar);
or if (log.traceEnabled()) {
log.trace("Doing " + foo + " to " + bar);
}If you want the builtin interpolation to become a noop in the face runtime log disabling then the logging library has to be a builtin too.
A logging framework may have the equivalent of prepared statements. You may also nudge usage where the raw string API is `log.traceRaw(String rawMessage)` while the parametrized one has the nicer naming `log.trace(Template t, param1, param2)`.
There are many ideas if you look at SQL libs. In my example I used a different type but there other solutions. Be creative.
logger.log(new Template("foo"))`
logger.log("foo", [])
logger.prepare("foo").log()You pass "foo" to Template. The Template will be instantiated before log ever sees it. You conveniently left out where the Foo string is computed from something that actually need computation.
Like both:
new Template("doing X to " + thingBeingOperatedOn)
new Template("doing " + expensiveDebugThing(thingBeingOperatedOn))
You just complicated everything to get the same class of error.Heck even the existing good way of doing it, which is less complicated than your way, still isn't safe from it.
logger("doing {}", expensiveDebugThing(thingBeingOperatedOn))
All your examples have the same issue, both with just string concatenation and more expensive calls. You can only get around an unknowing or lazy programmer if the compiler can be smart enough to entirely skip these (JIT or not - a JIT would need to see that these calls never amount to anything and decide to skip them after a while. Not deterministically useful of course).> You conveniently left out where the Foo string is computed from something that actually need computation.
I left it out because the comment I was replying to was pointing that some logs don't have params.
For the approach using a `Template` class, the expectation would be that the doc would call out why this class exists in the first place as to enable lazy computation. Doing string concatenation inside a template constructor should raise a few eyebrows when writing or reviewing code.
I wrote `logger.log(new Template("foo"))` in my previous comment for brevity as it's merely an internet comment and not a real framework. In real code I would not even use stringy logs but structured data attached to a unique code. But since this thread discusses performance of stringy logs, I would expect log templates to be defined as statics/constants that don't contain any runtime value. You could also integrate them with metadata such as log levels, schemas, translations, codes, etc.
Regarding args themselves, you're right that they can also be expensive to compute in the first place. You may then design the args to be passed by a callback which would allow to defer the param computation.
A possible example would be:
const OPERATION_TIMEOUT = new Template("the operation $operationId timed-out after $duration seconds", {level: "error", code: "E_TIMEOUT"});
// ...
function handler(...) {
// ..
logger.emit(OPERATION_TIMEOUT, () => ({operationId: "foo", duration: someExpensiveOperationToRetrieveTheDuration()}))
}
This is still not perfect as you may need to compute some data before the log "just in case" you need it for the log. For example you may want to record the current time, do the operation. If the operation times out, you use the time recorded before the op to compute for how long it ran. If you did not time out and don't log, then getting the current system time is "wasted".All I'm saying is that `logger.log(str)` is not the only possible API; and that splitting the definition of the log from the actual "emit" is a good pattern.
https://typing.python.org/en/latest/spec/literal.html#litera...
#define foo(x) ( \
(void)std::integral_constant<char, (x)[0]>::value, \
foo_impl(x) \
)
(the re-evaluation of x doesn't matter if it compiles). You can also use a user-defined literal which has a different ergonomic problem.For Rust’s macros, a literal can be expressed as `$arg:lit`. This does allow other literals as well, such as int or float literals, but typically the generated code would only work for a string literal.
You may be misunderstanding something here.
If you follow the varargs-style recommendation, then concatenation occurs in the log class.
If you follow the guard-style recommendation, then the interpolated expressions will not be evaluated unless the log level matches.
In the naive approach, concatenation always occurs and all expressions which are part of the interpolation will be evaluated no matter the log level.
Could it be that you were thinking about StringBuffer vs. concatenation, an entirely unrelated problem?
The former still creates strings, for the garbage collector to mop up even when log.traceEnabled() is false, no?
Also, even if the former or latter is implemented as:
fn trace(log, str, args...) {
if (!log.tracing) return;
// ...
}
Most optimising JIT compilers will code hoist the if-condition when log.tracing is false, anyway.That's not an overhead at all. Even if it were it's not compareable to string concatenation.
Regarding overhead of lambda and copying params. Depends on the language, but usually strings are pass by ref and pass by values are just 1 word long, so we are talking one cycle per variable and 8 bytes of memory. Which were already paid anyways.
That said, logging functions that just take a list of vars are even better, like python's print()
> printtrace("var x and y",x,y)
> def printtrace(*kwargs):
>> print(kwargs) if trace else None
Python gets a lot of slack for being a slow language, but you get so much expressiveness that you can invest in optimization after paying a flat cycle cost.
The problem the OP is pointing out is that some programmers are incompetent and do string concatenation anyway. A mistake which if anything is even easier in Python thanks to string interpolation.
1. Production code running somewhere on a cluster.
2. Released code running somewhere on a end-user's machine.
3. Released production code running somewhere on an end-user's cluster.
And errors happen at weird times, like 3am on a Sunday morning on someone else's cluster. So I'd just as soon not have to wake up, figuring out all the paperwork to get access to some other company's cluster, and then figure out how to attach a debugger. Especially when the error is some non-reproducible corner case in a distributed algorithm that happens once every few months, and the failing process is long gone. Just no.
It is so much easier to ask the user to turn up logging and send me the logs. Nine times out of ten, this will fix the problems. The tenth time, I add more logs and ask the user to keep an eye open.
This is more flexible as it still allows runtime configuration of the logging level.
The OP is simply pointing that some programmers are incompetent and call the trace function incorrectly.
Of course they shouldn't just be dumping text to stdout/stderr, but as long as the library logging is optional (or only logs when the library has reached some kind of unrecoverable state with instructions to file a bug report), logging is often the right call.
It's easier to have logs and turn them off at compile time/runtime than to not have logs and need them once deployed.
Even relatively complex operations like say convert this document into a PDF etc basically only has two useful states either it worked or something specific failed at which point just tell me that thing.
Now independent software like web servers or database can have useful logs because they have completely independent interfaces with the outside world. But I call libraries they don’t call me.
Any library can do a bad job here, that doesn’t come down to logging vs error messages.
Second, it can lose information about at what exact time and in what exact order things happened. For example, cleanup operations during stack unwinding can also produce log messages, and then it’s not clear anymore that the original error happened before those.
Even when you include a timestamp at each level, that’s often not sufficient to establish a unique ordering, unless you add some sort of unique counter.
It gets even more complicated when exceptions are escalated across thread boundaries.
Personally I don't mind it... the whole "$outer: $inner" convention naturally lends to messages that still parse in my brain and actually include the details in a pretty natural way. Something like:
"Error starting up: Could not connect to database: Could not read database configuration: Could not open config file: Permission denied"
Tells me the config file for the database has broken permissions. Because the permission denied error caused a failure opening the config file, which caused a failure reading the database configure, which caused a failure connecting to the database, which caused an error starting up. It's deterministic in that for "$outer: $inner", $inner always caused $outer.
Maybe it's just experience though, in a sense that it takes a lot of time and familiarity for someone to actually prefer the above. Non-technical people probably hate such messages and I don't necessarily blame them.
Then catch the exception on the backup path and wrap it in a custom exception that conveys to the handler the fact that you were on the backup path. Then throw the new exception.
Errors are massaged for the reader - a database access library will know that a DNS error occurred and that is (the first step for debugging) why it cannot connect to the specified datastore. The service layer caller does not need to know that there is a DNS error, it just needs to know that the specified datastore is uncontactable (and then it can move on to the approriate resilience strategy, retry that same datastore, fallback to a different datastore, or tell the API that it cannot complete the call at all).
The caller can then decide what to do (typically say "Well, I tried, but nothing's happening, have yourself a merry 500)
It makes no sense for the Service level to know the details of why the database access layer could not connect, no more than it makes any sense for the database access layer to know why there is a DNS configuration error - the database access just needs to log the reasons (for humans to investigate), and tell the caller (the service layer) that it could not do the task it was asked to do.
If the service layer is told that the database access layer encountered a DNS problem, what is it going to do?
Nothing, the best it can do is log (tell the humans monitoring it) that a DB access call (to a specific DB service layer) failed, and try something else, which is a generic strategy, one that applies to a host of errors that the database call could return.
Should if throw an exception for that to let you know, or should it gracefully fallback so your service stays alive ? The middle ground is leaving a log and chugging along, your proposition throws that out of the window.
There are a lot of libraries that haven non-idempotent actions. There are a lot of inputs that can be problematic to log, too.
I guess in those cases standard practice is for lib to return a detailed error yeah.
As far as traces, trying to solve issues that depend on external systems is indeed a tall order for your code. Isn't it beyond the scope of the thing being programmed.
It is very, very common that the code that you have written isn't even the code that executes. It gets modified by enterprise anti virus or "endpoint security". All too often do I see "File.Open" calls return true that the caller has access, but actually what's happened is AV has intercepted the call, blocked it improperly, and returns 0 bytes file that exists (even though there is actually a larger file there) instead of saying the file cannot open.
I will never, in a million years, be granted access to attach a debugger to such a client computer. In fact, they will not even initially disclose that they are using anti virus. They will just say the machine is set up per company policy and that your software doesn't work, fix it. The assumption is always that your software is to blame and they give you nearly nothing, except for the logs.
The only way I ever get this solved in a reasonable amount of time is by looking at verbose logs, determining that the scenario they have described is impossible, explaining which series of log messages is not able to occur, yet occurred on their system, and ask them to investigate further. Usually this ends up being closed with a resolution like "Checked SuperProtectPro360 logs and found it was writing infernal error logs at the same time as using the software. Adjusted the monitoring settings and problem is now resolved."
Either way logging the input (file name) is notably not sufficient for debugging if the file can change between invocations. The action can be idempotent and still be affected by other changes in the system.
> trying to solve issues that depend on external systems is indeed a tall order for your code. Isn't it beyond the scope of the thing being programmed.
If my program is broken I need it fixed regardless of why it’s broken. The specific example here of a file changing is likely to manifest as flakiness that’s impossible to diagnose without detailed logs from within the library.
I will say that error handling and logging in general is one of my weakpoints, but I made a comment about my approach so far being dbg/pdb based, attaching a debugger and creating breakpoints and prints ad-hoc rather than writing them in code. I'm sure there's reasons why it isn't used as much and logging in code is so much more common, but I have faith that it's a path worth specializing in.
Back to the file reading example, for a non-idempotent function. Considering we are using an encapsulating approach we have to split ourselves into 3 roles. We can be the IO library writer, we can be the calling code writer, and we can be an admin responsible for the whole product. I think a common trap engineers fall for is trying to keep all of the "global" context (or as much as they can handle) at all times.
In this case of course we wouldn't be writing the non-idempotent library, so of course that's not a hat we wear, do not quite care about the innards of the function and its state, rather we have a well defined set of errors that are part of the interface of the function (EINVAL, EACCES, EEXIST).
In this sense we respect the encapsulation boundaries and are provided the information necessary by the library. If we ever need to dive into the actual library code, first the encapsulation is broken and we are dealing with a leaky abstraction, second we just dive into the library code, (or the filesystem admin logs themselves).
It's not precisely the type of responsibility that can be handled at design time and in code anyways, when we code we are wearing the calling-module programmer hat. We cannot think of everything that the sysadmin might need at the time of experiencing an error, we have to think that they will be sufficiently armed with enough tools to gather the information necessary with other tools. And thank god for that! checking /proc/fs and looking at crash dumps, and attaching processes with dbg will yield far better info than relying on whatever print statements you somehow added to your program.
Anyways at least that's my take on the specific example of glibc-like implementations of POSIX file operations like open(). I'm sure the implications may change for other non-idempotent functions, but at some point, talking about specifics is a bit more productive than talking in the abstract.
I generally agree that I would not expect a core library to do a bunch of logging, at least not onto your application logs. This stuff generally is very stable with a clean interface and well defined error reporting.
But there’s a whole world of libraries that are not as clean, not as stable, and not as well defined. Most libraries in my experience are nowhere near as clean as standard IO libraries. They often do very complex stuff to simplify for the calling application and have weakly defined error behavior. The more complexity a library contains, the more it likely has this issue. Arguably that is leaky abstraction but it’s also the reality of a lot of software and I’m not even sure that’s a bad thing. A good library that leaks in unexpected conditions might be just fine for many real world purposes.
You don't want to throw because destroying someone's production isn't worth it. You don't want to silent continue in that state because realistically there's no way for application owner to understand what is happening and why.
Anything that has a measurable impact on production should be logged above that, except if your system ignores log levels in the first place, but that's another can of worms.
Warnings are just errors that no one wants to deal with.
One test case here is that your library has existed for a decade and was fast, but Java removed a method that let you make it fast, but you can still run slow without that API. Java the runtime has a flag that the end use can enable to turn it back on a for a stop gap. How do you expect this to work in your model, you expect to have an onUnnecessarilySlow() callback already set up that all of your users have hooked up which is never invoked for a decade, and then once it actually happens you start calling it and expect it to do something at all sane in those systems?
Second example is all of the scenarios where you're some transitively used library for many users, it makes and callback strategy immediately not work if the person who needs to know about the situation and could take action is the application owner rather than the people writing library code which called you. It would require every library to offer these same callbacks and transitively propagate things, which would only work if it was just such a firm idiomatic pattern in some language ecosystem and I don't believe that it is in any language ecosystem.
>but Java removed a method that let you make it fast, but you can still run slow without that API
I’d like to see an example of that, because this is extremely hypothetical scenario. I don’t think any library is so advanced to anticipate such scenarios and write something to log. And of course Java specifically has longer cycle of deprecation and removal. :)
As for your second example, let’s say library A is smart and can detect certain issues. Library B depending on it is at higher abstraction level, so it has enough business context to react on them. I don’t think it’s necessary to propagate the problem and leak implementation details in this scenario.
https://github.com/protocolbuffers/protobuf/issues/20760
Java Protobuf also logs a warning now if you can tell you are using gencode old enough that it's covered by a DoS CVE. They actually did a release that broke compatability of the CVE covered gencode but restored it and print a warning in a newer release.
Java 8 is still really popular, probably the most popular single version. It's not just servers in context, but also Android where Java 8 is the highest safe target, it's not clear what decade we'll be in when VarHandle would be safe to use there at all.
VarHandle was Java 9 but MemorySegment was Java 17. And the rest of FFM is only in 25 which is fully bleeding edge.
Protobuf may realistically try to move off of sun.misc.unsafe without the performance regressions in a way that is without adopting MemorySegment to avoid the versioning problem, but it takes significant and careful engineering time.
That said it's always possible to have waterfall of preferred implementations based on what's supported, it's just always an implementation/verification costs.
It’s very rarely realistic that a client would code up meaningful paths for every possible failure mode in a library. These callbacks are usually reserved for expected conditions.
Yes, that’s the point. You log it until you encounter it for the first time, then you know more and can do something meaningful. E.g. let’s say you build an API client and library offers callback for HTTP 429. You don’t expect it to happen, so just log the errors in a generic handler in client code, but then after some business logic change you hit 429 for the first time. If library offers you control over what is going to happen next, you may decide how exactly you will retry and what happens to your state in between the attempts. If library just logs and starts retry cycle, you may get a performance hit that will be harder to fix.
I would much prefer a library have sane defaults, reasonable logging, and a way for me to plug in callbacks where needed. Writing On429 and a hundred other functions that just point to Logger.Log is not a good use of time.
I still don’t want to have to provide a 429 callback just to log, though. The library should log by default if the callback isn’t registered.
var client = aClient()
.onError((request,response) -> {
LOG.debug(…);
return FAIL;
}).build();
And eventually you do this: return switch(response.code()) {
case 429 -> RETRY;
default -> FAIL;
}
Or something more interesting, e.g. with more details of retry strategy.The application owner should be able to adjust the contexts up or down. This is the point of ownership and where responsibility over which logs matter is handled.
A library author might have ideas and provide useful suggestions, but it's ultimately the application owner who decides. Some libraries have huge blast radius and their `error` might be your `error` too. In other contexts, it could just be a warning. Library authors should make a reasonable guess about who their customer is and try to provide semantic, granular, and controllable failure behavior.
As an example, Rust's logging ecosystem provides nice facilities for fine-grained tamping down of errors by crate (library) or module name. Other languages and logging libraries let you do this as well.
That capability just isn't adopted everywhere.
This is difficult to reconcile with libraries only logging on a debug level.
Usually a program is being used by the user to accomplish something, and if logging is meaningful than either in a cli context or a server context. In both cases, errors are more often being seen by people/users than by code. Therefore printing them to logs make sense.
While a lib is being used by a program. So it has a better way to communicate problems with the caller (and exceptions, error values, choose the poison of your language). But I almost never want a library to start logging shit because it’s almost guaranteed to not follow the same conventions as I do in my program elsewhere. Return me the error and let me handle.
It’s analogous to how Go has an implicit rule of that a library should never let a panic occur outside the library. Internally, fine. But at the package boundary, you should catch panics and return them as an error. You don’t know if the caller wants the app to die because it an error in your lib!
It’s very reasonable that a logging framework should allow higher levels to adjust how logging at lower levels is recorded. But saying that libraries should only log debug is not. It’s very legitimate for a library to log “this looks like a problem to me”.
Consider a Smalltalk-like system, or something like TCL, that doesn’t distinguish between programs and libraries regarding invocation mechanism. How would you handle logging in that case?
When you publish a library, people are going to use it more liberally and in a wider range of contexts (which are therefore harder to predict, including whether a given violation requires human intervention)
A library might also be used in multiple place, maybe deeply in a dependency stack, so the execution context (high level stack) matters more than which library got a failure.
So handling failures should stay in the hands of the developer calling the library and this should be a major constraint for API design.
Ultimately, though, it's important to be using a featureful logging framework (all the better if there's a "standard" one for your language or framework), so the end user can enable/disable different levels for different modules (including for your library).
WARN Foo not available; falling back to Bar
In server contexts this is usually unnecessary noise, if it’s not change detection. Of course, good logging framework will help you to mute irrelevant messages, but as I said in another comment here, it’s a matter of practicality. Library shouldn’t create extra effort for its users to fine-tune logging output, so it must use reasonable defaults.Some languages (e.g. Java) include a stack trace when reporting an error, which is extremely useful when logging the error. It shows at exactly which point in the code the error was generated, and what the full call stack was to get there.
It's a real shame that "modern" languages or "low level" languages (e.g. Go, Rust) don't include this out of the box, it makes troubleshooting errors in production much more difficult, for exactly the reason you mention.
C++23 added std::tracktrace, but until it includes stacktrace from arbitrary exception, we're sticking with Boost.
(Even in C, though... errors should be surfaced as return values from functions causing the error, not just logged somewhere. Debug info, sure, have a registerable callback for that.)
I think creating the hooks is very close to just not doing anything here, if no one is going to use the hooks anyway then you might as well not have them.
1) Thrown errors should track the original error to retain its context. In JavaScript errors have a `cause` option which is perfect for this. You can use the `cause` to hold a deep stack trace even if the error has been handled and wrapped in a different error type that may have a different semantics in the application.
2) For logging that does not stop program execution, I think this is a great case for dependency injection. If a library allows its consumer to provide a logger, the application has complete control over how and when the library logs, and can even change it at runtime. If you have a disagreement with a library, for example it logs errors that you want to treat as warnings, your injected logger can handle that.
- Critical / Fatal: Unrecoverable without human intervention, someone needs to get out of bed, now.
- Error : Recoverable without human intervention, but not without data / state loss. Must be fixed asap. An assumption didn't hold.
- Warning: Recoverable without intervention. Must have an issue created and prioritised. ( If business as usual, this could be downgrading to INFO. )
The main difference therefore between error and warning is, "We didn't think this could happen" vs "We thought this might happen".So for example, a failure to parse JSON might be an error if you're responsible for generating that serialisation, but might be a warning if you're not.
For me logs should complement metrics, and can in many instances be replaced by tracing if the spans are annotated sufficiently. But making metrics out of logs is both costly and a bit brittle.
God doesn't tell you the future so good luck figuring out which logs you really need.
For example, when a process implies a conversion according to the contract/convention, but we know that this conversion may be not the expected result and the input may be based on semantic misconceptions. E.g., assemblers and contextually truncated values for operands: while there's no issue with the grammar or syntax or intrinsic semantics, a higher level misconception may be involved (e.g., regarding address modes), resulting in a correct but still non-functional output. So, "In this individual case, there may be or may be not an issue. Please, check. (Not resolvable on our end.)"
(Disclaimer: I know that this is a very much classic computing and that this is now mostly moved to the global TOS, but still, it's the classic example for a warning.)
What about conditions like "we absolutely knew this would happen regularly, but it's something that prevents the completion of the entire process which is absolutely critical to the organization"
The notion of an "error" is very context dependent. We usually use it to mean "can not proceed with action that is required for the successful completion of this task"
Crashing your web stack because one route hit an error is a dumb idea.
And no, calling it a warning is also dumb idea. It is an error.
This article is a navel gazing expedition.
They're kind of right but you can turn any warning into an error and vice versa depending on business needs that outweigh the technical categorisation.
Warning to me is an error that has very little business logic side effects/impact as opposed to an Error, but still requires attention.
On the other hand it could be that the API had changed slightly. Say they for some reason decided to rename the input parameter postcode to postal_code and I didn’t change my code to fix this. This is 100% a programming error that would be classified as critical but I would not want to panic() the entire server process over it. I just want an alert that hey there is a programming error, go fix it.
But what could also happen is that when I try to construct a request for the external API and the OS is out of memory. Then I want to just crash the process and rely on automatic process restarts to bring it back up. BTW logging an error after malloc() returns NULL needs to be done carefully since you cannot allocate more memory for things like a new log string.
- info - when this was expected and system/process is prepared for that (like automatic retry, fallback to local copy, offline mode, event driven with persistent queue etc) - warning - when system/process was able to continue but in degraded manner, maybe leaving decision to retry to user or other part of system, or maybe just relying on someone checking logs for unexpected events, this of course depends if that external system is required for some action or in some way optional - error - when system/process is not able to continue and particular action has been stopped immediately, this includes situation where retry mechanism is not implemented for step required for completion of particular action - fatal - you need to restart something, either manually or by external watchdog, you don’t expect this kind of logs for simple 5xx
This scenario may or may not yield in data/state loss, it may also be something that you, yourself can't immediately fix. And if it's temporary, what is the point of creating an issue and prioritizing.
I guess my point is that to any such categorization of errors or warnings there are way too many counter examples to be able to describe them like that.
So I'd usually think that Errors are something that I would heuristically want to quickly react to and investigate (e.g. being paged, while Warnings are something I would periodically check in (e.g. weekly).
That being said, I find tying the level to expected action a more useful way to classify them.
If for debugging, the levels seem relevant in the sense of how quickly we are able to use that information to understand what is going wrong. Out of potential sea of logs we want to see first what were the most likely culprits for something causing something to go wrong. So the higher the log level, the higher likelihood of this event causing something to go wrong.
If for alerting, they should reflect on how bad is this particular thing happening for the business and would help us to set a threshold for when we page or have to react to something.
Also, you should have event logs you can look to make administrative decisions. That information surely fits into those, you will want to know about it when deciding to switch to another provider or renegotiate something.
For service A, a 500 error may be common and you just need to try again, and a descriptive 400 error indicates the original request was actually handled. In these cases I'd log as a warning.
For service B, a 500 error may indicate the whole API is down, in which case I'd log a warning and not try any more requests for 5 minutes.
For service C, a 500 error may be an anomaly and treat it as hard error and log as error.
Also, you can't know how frequently you'll get 500s at the time you're doing integration, so you'll have to go back after some time to revisit log severities. Which doesn't sound optimal.
In theory some sort of internal API status tracking thing would be better that has some heuristic of is the API up or down and the error rate. It should warn you when the API is down and when it comes back up. Logging could still show an error or a warning for each request but you don’t need to get an email about each one.
Because what I'd want to know is how often does it fail, which is a metric not a log.
So expose <third party api failure rate> as a metric not a log.
If feeding logs into datadog or similar is the only way you're collecting metrics, then you aren't treating your observablity with the respect it deserves. Put in real counters so you're not just reacting to what catches your eye in the logs.
If the third party being down has a knock-on effect to your own system functionality / uptime, then it needs to be a warning or error, but you should also put in the backlog a ticket to de-couple your uptime from that third-party, be it retries, queues, or other mitigations ( alternate providers? ).
By implementing a retry you planned for that third party to be down, so it's just business as usual if it suceeds on retry.
How do you define uptime? What if e.g. it's a social login / data linking and that provider is down? You could have multiple logins and your own e-mail and password, but you still might lose users because the provider is down. How do you log that? Or do you only put it as a metric?
You can't always easily replace providers.
The different issue is when third party broke the contract, so suddenly you get a lot of 4xx or 5xx responses, likely unrecoverable. Then you get ERROR level messages in the log (because it’s unexpected problem) and an alert when there’s a spike.
It’s not controversial; you just want something different. I want the opposite: I want to know why/how it fails; counting how often it does is secondary. I want a log that says "I sent this payload to this API and I got this error in return", so that later I can debug if my payload was problematic, and/or show it to the third party if they need it.
eg: 2.0 for "trace" / 1.0 for "debug" / 0.0 for "info" / -1.0 for "warn" / -2.0 for "error that can be handled"
But it is still an error condition, i.e. something does need to be fixed - either something about the connection string (i.e. in the local system) is wrong, or something in the other system or somewhere between the two is wrong (i.e. and therefore needs to be fixed). Either way, developers on this end (I mean someone reading the logs - true that it might not be the developers of the SMTP mailer) need to get involved, even if it is just to reach out to the third party and ask them to fix it on their end.
A condition that fundamentally prevents a piece of software from working not being considered an error is mad to me.
Plus, a remote server not being reachable doesn't say anything about where the problem lies. Did you mess up a routing table? Did your internet connection get severed? Did you firewall off an important external server? Did you end up on a blacklist of some kind?
These types of messages are important error messages for plenty of people. Just because your particular use case doesn't care about the potential causes behind the error doesn't mean nobody does.
Maybe or maybe not. If the connection problem is really due to the remote host then that's not the problem of the sender. But maybe the local network interface is down, maybe there's a local firewall rule blocking it,...
If you know the deployment scenario then you can make reasonable decisions on logging levels but quite often code is generic and can be deployed in multiple configurations so that's hard to do
But if you are the SMTP library and that you unilaterally log that as an error. That is an issue.
The closest we have is something like Java with exceptions in type signatures, but we would have to ban any kind of exception capture except from final programs, and promote basically any logger call int an exception that you could remotely suppress.
We could philosophize about a world with compilers made out of unobtanium - but in this reality a library author cannot know what conditions are fixable or necessitate a fix or not. And structured logging lacks has way too many deficiencies to make it work from that angle.
I don't even know how to say whether these definitions are right or wrong, it's just whatever you feel like it should be. The important thing is what your program logs should be documented somewhere, the next most important thing is that your log levels are self consistent and follow some sort of logic, and that I would have done it exactly the same is not really important.
At the end of the day, this is just bikeshedding about how to collapse ultra specific alerting levels into a few generic ones. E.g. RFC 5424 defines 8 separate log levels for syslog and, while that's not a ceiling by any means, it's easy to see how there's already not really going to be a universally agreed way to collapse even just these down to 4 categories.
There’s a whole industry of “we’ll manage them for you” which is just enabling dysfunction.
That's exactly why you log it as a warning. People get warned all the time about the dangers of smoking. It's important that people be warned about smoking; these warnings save lives. People should pay attention to warnings, which let them know about worrisome concerns that should be heeded. But guess what? Everyone has a story about someone who smoked until they were 90 and died in a car accident. It is not an error that somebody is smoking. Other systems will make their own bloody decisions and firewalling you off might be one of them. That is normal.
What do you think a warning means?
- An error is an event that someone should act on. Not necessarily you. But if it's not an event that ever needs the attention of a person then the severity is less than an error.
Examples: Invalid credentials. HTTP 404 - Not Found, HTTP 403 Forbidden, (all of the HTTP 400s, by definition)
It's not my problem as a site owner if one of my users entered the wrong URL or typed their password wrong, but it's somebody's problem.
A warning is something that A) a person would likely want to know and B) wouldn't necessarily need to act on
INFO is for something a person would likely want to know and unlikely needs action
DEBUG is for something likely to be helpful
TRACE is for just about anything that happens
EMERG/CRIT are for significant errors of immediate impact
PANIC the sky is falling, I hope you have good running shoes
Some of these things can be ameliorated with well-behaved UI code, but a lot cannot, and if your primary product is the API, then you're just going to have scads of ERRORs to triage where there's literally nothing you can do.
I'd argue that anything that starts with a 4 is an INFO, and if you really wanted to be through, you could set up an alert on the frequency of these errors to help you identify if there's a broad problem.
In other words, of course you don't alert on errors which are likely somebody else's problem. You put them in the log stream where that makes sense and can be treated accordingly.
Personally, I'd further qualify that. It should be logged as an error if the person who reads the logs would be responsible for fixing it.
Suppose you run a photo gallery web site. If a user uploads a corrupt JPEG, and the server detects that it's corrupt and rejects it, then someone needs to do something, but from the point of view of the person who runs the web site, the web site behaved correctly. It can't control whether people's JPEGs are corrupt. So this shouldn't be categorized as an error in the server logs.
But if you let users upload a batch of JPEG files (say a ZIP file full of them), you might produce a log file for the user to view. And in that log file, it's appropriate to categorize it as an error.
You cannot accurately assign responsibility until you understand the problem.
4xx is for client side errors, 5xx is for server side errors.
For your situation you'd respond with an HTTP 400 "Bad Request" and not an HTTP 500 "Internal Server Error" because the problem was with the request not with the server.
So the library you are using fires too many debug messages? You know, that you can always turn it off by ignoring specific sources, like ignoring namespaces? So what exactly do you lose? Right. Almost nothing.
As for my code and libraries I always tend to do both, log the error and then throw an exception. So I am on the safe side both ways. If the consumer doesn’t log the exception, then at least my code does it. And I give them the chance to do logging their way and ignore mine. I am doing a best-guess for you… thinking to myself, what’s an error when I’d use the library myself.
You don’t trust me? Log it the way you need to log it, my exception is going to transport all relevant data to you.
This has saved me so many times, when getting bug reports by developers and customers alike.
There are duplicate error logs? Simply turn my logging off and use your own. Problem solved.
If it is a program level error, maybe a warning and returning the error is the correct way to do. Maybe it’s not? It depends on the context.
And this basically is the answer to any software design question: It depends.
https://docs.openstack.org/oslo.log/latest/user/guidelines.h...
FWIW, "ERROR: An error has occurred and an administrator should research the event." (vs WARNING: Indicates that there might be a systemic issue; potential predictive failure notice.)
If I have a daily cron job that is copying files to a remote location (e.g. backups), and the _operation_ fails because for some reason the destination is not writable.
Your suggestion would get me _both_ alerts, as I want; the article's suggestion would not alert me about the operation failing because, after all, it's not something happening in the local system, the local program is well configured, and it's "working as expected" because it doesn't need neither code nor configuration fixing.
Obviously this depends on teams, application context and code bases. But "knowing if action needs to be taken" can't be boiled into a simple log level for most cases.
There is a reason most alerting software like pagerduty is just a trigger interface and the logic for what constitutes the "error" is typically some data level query in something like datadog, sumologic, elastic search, or graphana, that either looks for specific string messages, error types, or a collection of metric conditions.
Cool if you want to consider that any error level log needs to be an actionable error but what quickly happens is that some error cases are auto retry able due to infrastructure conditions that the application has completely no knowledge of. And to run some sort of infrastructure query at error write time in code, eg
1. Error is thrown 2. Prior to logging guess/determine if the case can be retired through a few http calls. 3. Log either a warning or an error
Seems to be a complete waste when we could just write some sort of query in our log/metrics management platform of choice which takes into account the infrastructure conditions for us.
In your example,
- "Error while serving file" would be a bad error message,
- "Failed to read file 'foo/bar.html'" would be acceptable, and
- "Failed to read file 'foo/bar.html' due to EIO: Underlying device error (disk failure, I/O bus error). Please check the disk integrity." would be perfect (assuming the http server has access to the underlying error produced by the read operation).
The first error message was "No usable public key found, which is required for the deployment" which doesn't tell me what I have to do to correct the problem. Nothing about even where it's looking for keys, what is supposed to create the key or how I am supposed to create the key.
There are other examples and discussion of what they should say in the link.
Edit: Here's another one that I filed: https://github.com/containers/podman/issues/20775
At work, I can think of cases where we error when data mismatches between two systems. It’s almost always the fault of system B but we present the mismatch error neutrally. Experienced developers just know to fix B but we shouldn’t rely on that.
https://web.mit.edu/jemorris/humor/500-miles
Or you can't connect because of a path MTU error.
Or because the TTL is set to low?
Your software at the server level has no idea what's going wrong at the network level, all you can send is some kind of network problem message.
My 3D printer will try to walk you through basic fixes with pictures on the device's LCD panel, but for some errors it will display a QR code to their wiki which goes into a technical troubleshooting guide with complex instructions and tutorial videos.
What is possible is to include as much information about what the system was trying to do. If there's an file IO error, include the the full path name. Saying "file not found" without saying which file was not found infuriates me like few other things.
If some required configuration option is not defined, include the name of the configuration option and from where it tried to find said configuration (config files, environment, registry etc). And include the detailed error message from the underlying system if any.
Regular users won't have a clue how to deal with most errors anyway, but by including details at least someone with some system knowledge has a chance of figuring out how to fix or work around the issue.
If yes: ERROR If I want to check it tomorrow: WARNING If it's useful for debugging: INFO Everything else: DEBUG
The problem with the article's approach is that libraries don't have enough context. A timeout calling an external API might be totally fine if you're retrying, but it's an ERROR if you've exhausted retries and failed the user's request.
We solve this by having libraries emit structured events with severity hints, then the application layer decides the final log level based on business impact. A 500 from a recommendation service? Warning. A 500 from the payment processor? Error.
A connection timed out, retrying in 30 secs? That's a warning. Gave up connecting after 5 failed attempts? Now that's an error.
I don't care so much if the origin of the error is within the program, or the system, or the network. If I can't get what I'm asking for, it can't be a mere warning.
A warning can be ignored safely. Warnings may be 'debugging enabled, results cannot be certified' or something similar.
An error should not be ignored, an operation is failing, data loss may be occurring, etc.
Some users may be okay with that data loss or failing operation. Maybe it isnt important to them. If the program continues and does not error in the parts that matter to the user, then they can ignore it, but it is still objectively an error occurring.
A fatal message cannot be ignored, the system has crashed. Its the last thing you see before shutdown is attempted.
if we're talking about logs from our own applications that we have written, the program should know because we can write it in a way that it knows.
user-defined config should be verified before it is used. make a ping to port 25 to see if it works before you start using that config for actual operation. if it fails the verification step, that's not an error that needs to be logged.
i'm following the author's example that an SMTP connection error is something you want to investigate and fix. if you have a different system with different assumptions where your response to a mailserver being unreachable is to ignore it, obviously that example doesn't apply for you. i'm not saying, and i don't think the author is saying that SMTP errors should always or never be logged as errors.
when the mailserver endpoint has changed, you should do the thing that makes sense in the context of your application. if it's not something that the person responsible for reviewing the logs needs to know about, don't log it. if it is, then log it.
Even if your libraries use nothing but exceptions or return codes you still end up with levels. You still end up with logs that have information in them that gets ignored when it shouldn't be because there's so much noise that people get tired of all the "cries of wolf."
Occasionally one is at a high enough level to know for sure that something needs fixing and for this I use "CRITICAL" which is my code for "absolutely sure that you can't ignore this."
IMO it's about time AI was looking at the logs to find out if there was something we really need to be alerted to.
This post frames the problem almost entirely from a sysadmin-as-log-consumer perspective, and concludes that a correctly functioning system shouldn’t emit error logs at all. That only holds if sysadmins are the only "someone" who can act.
In practice, if there is a human who needs to take action - whether that’s a developer fixing a bug, an infra issue, or coordinating with an external dependency - then it’s an error. The solution isn’t to downgrade severity, but to route and notify the right owner.
Severity should encode actionability, not just system correctness.
Did it do what it was supposed to do, but in a different way or defer for retrying later? Then WARN.
Did it fail to do what it needed to do? ERROR
Did it do what it needed to do in the normal way because it was totally recoverable? INFO
Did data get destroyed in the process? FATAL
It should be about what the result was, not who will fix it or how. Because that might change over time.
> Did it fail to do what it needed to do? ERROR
> Did it do what it needed to do in the normal way because it was totally recoverable? INFO
We have a web-facing system (it uses a custom request-response protocol on top of Websocket... it's an old system) that users are routinely trying to, ahem, automate even though it's technically against ToS but hey, as long as we don't catch them? Anyway, it's quite often to see user connections that send malformed commands and then get disconnected after we send them a critical_error/protocol_error message — we do have quite extensive validation logic for user commands.
So, how should such errors be logged in your opinion? I know that we originally logged them as errors but very quickly changed to warnings, and precisely for the reasons outlined in TFA: if some kewl haxxor can't figure out how to quote strings in JSON, it's not really something we can't fix. We probably should keep the records, just to know that "oh, some script kiddie was trying to hack us during that time period" but nothing more than that; it definitely doesn't warrant the "hey, there are too many errors in sfo2 location, please take a look" summons at 3:00 AM from the ops team.
So it's natural for error messages to be expected, as you progressively add and then clear up edge cases.
My company now has a log aggregator that scans the logs for errors, when it finds one, creates a Trello card, uses opus to fix the issue and then propose a PR against the card. These then get reviewed, finished if tweaks are necessary and merged if appropriate.
Our production system pages oncall for any errors. At night it will only wake somebody up for a whole bunch of errors. This discipline forces us to take a look at every ERROR and decide if it is spurious and out of our control or something we can deal with. At some point our production system will reach a scale where there are errors logged constantly and this strategy Durant make sense any more. But for now it helps keep our system clean.
if error == NULL and operationFailed then log error Otherwise Let client side do the error handling (in terms of logging)
A mail program not being to checks notes send emails sounds like an error to me. (Unless you implement retries.)
I think discussions that argue over a specific approach are a form of playing checkers.
You’re kind of telling a story to future potential trouble-shooters.
When you don’t think about it at all (it doesn’t take much), you tend to log too much and too little and at the wrong level.
But this article isn’t right either. Lower-level components typically don’t have the context to know whether a particular fault requires action or not. And since systems are complex, with many levels of abstractions and boxes things live in, actually not much is in a position to know this, even to a standard of “probably”.
error_msg = "xyz went wrong"
log.warn(error_msg)
My comment on the CR was about this being an inherent contradiction and incredibly confusing to know if it's actually an error or a warning..In an ideal world things like logs and alarms (alerting product support staff) should certainly cleanly separate things that are just informative, useful for the developer, and things that require some human intervention.
If you don't do this then it's like "the boy that cried wolf", and people will learn to ignore errors and alarms since you've trained them to understand that usually no action is needed. It's also useful to be able to grep though log files and distinguish failures of different categories, not just grep for specific failures.
* Database timeout (the database is owned by a separate oncall rotation that has alerts when this happens)
* ISE in downstream service (return HTTP 5xx and increment a metric but don’t emit an error log)
* Network error
* Downstream service overloaded
* Invalid request
Basically, when you make a request to another service and get back a status code, your handler should look like:
logfunc = logger.error if 400 <= status <= 499 and status != 429 else logger.warning
(Unless you have an SLO with the service about how often you’re allowed to hit it and they only send 429 when you’re over, which is how it’s supposed to work but sadly rare.)So people writing software are supposed to guess how your organization assigns responsibilities internally? And you're sure that the database timeout always happens because there's something wrong with the database, and never because something is wrong on your end?
As for queries that time out, that should definitely be a metric, but not pollute the error loglevel, especially if it’s something that happens at some noisy rate all the time.
1. When I personally see database timeouts at work it's rarely the database's fault, 99 times out of 100 it's the caller's fault for their crappy query; they should have looked at the query plan before deploying it. How is the error-handling code supposed to know? I log timeouts (that still fail after retry) as errors so someone looks at it and we get a stack trace leading me to the bad query. The database itself tracks timeout metrics but the log is much more immediately useful: it takes me straight to the scene of the crime. I think this is OP's primary point: in some cases, investigation is required to determine whether it's your service's fault or not, and the error-handling code doesn't have the information to know that.
2. As with exceptions vs. return values in code, the low-level code often doesn't know how the higher-level caller will classify a particular error. A low-level error may or may not be a high-level error; the low-level code can't know that, but the low-level code is the one doing the logging. The low-level logging might even be a third party library. This is particularly tricky when code reuse enters the picture: the same error might be "page the on-call immediately" level for one consumer, but "ignore, this is expected" for another consumer.
I think the more general point (that you should avoid logging errors for things that aren't your service's fault) stands. It's just tricky in some cases.
Not OP, but this part hits the same for me.
In the case your client app is killing the DB through too many calls (e.g. your cache is not working) you should be able to detect it and react, without waiting for the DB team to come to you after they investigated the whole thing.
But you can't know in advance if the DB connection errors are your fault or not, so logging it to cover the worse case scenario (you're the cause) is sensible.
I feel you're thinking about system wide downtime with everything timing out consistently, which would be detected by the generic database server vitals and basic logs.
But what if the timeouts are sparse and only 10 or 20% more than usual from the DB POV, but it affects half of your registration services' requests ? You need it logged application side so the aggregation layer has any chance of catching it.
On writing to ERROR or not, the hresholds should be whatever your dev and oncall teams decides. Nobody outside of them will care, I feel it's like arguing which drawer the socks should go.
I was in an org where any single error below CRITICAL was ignored by the oncall team , and everything below that only triggered alerts on aggregation or special conditions. Pragmatically, we ended up slicing it as ERROR=goes to the APM, anything below=no aggregation, just available when a human wants to look at it for whatever reason. I'd expect most orgs to come with that kind of split, where the levels are hooked to processes, and not some base meaning.
In the general case, the person writing the software has no way of knowing that "the database is owned by a separate oncall rotation". That's about your organization chart.
Admittedly, they'd be justified in assuming that somebody is paying attention to the database. On the other hand, they really can't be sure that the database is going to report anything useful to anybody at all, or whether it's going to report the salient details. The database may not even know that the request was ever made. Maybe the requests are timing out because they never got there. And definitely maybe the requests are timing out because you're sending too many of them.
I mean, no, it doesn't make sense to log a million identical messages, but that's rate limiting. It's still an error if you can't access your database, and for all you know it's an error that your admin will have to fix.
As for metrics, I tend to see those as downstream of logs. You compute the metric by counting the log messages. And a metric can't say "this particular query failed". The ideal "database timeout" message should give the exact operation that timed out.
My rough guess is that 75% of incidents on internal services were only reported by service consumers (humans posting in channels) across everywhere I’ve worked. Of the remaining 25% that were detected by monitoring, the vast majority were detected long after consumers started seeing errors.
All the RCAs and “add more monitoring” sprints in the world can’t add accountability equivalent to “customers start calling you/having tantrums on Twitter within 30sec of a GSO”, in other words.
The corollary is “internal databases/backend services can be more technically important to the proper functioning of your business, but frontends/edge APIs/consumers of those backend services are more observably important by other people. As a result, edge services’ users often provide more valuable telemetry than backend monitoring.”
In brief: drivers don’t obey the speed limit and backend service operators don’t prioritize monitoring. Both groups are supposed to do those things, but they don’t and we should assume they won’t change. As a result, it’s a good idea to wear seatbelts and treat downstream failures as urgent errors in the logs of consuming services.
Does it ?
Don't most stacks have an additional level of triaging logs to detect anomalies etc ? It can be your New relic/DataDog/Sentry or a self made filtering system, but nowadays I'd assume the base log levels are only a rough estimate of whether an single event has any chance of being problematic.
I'd bet the author also has strong opinions about http error codes, and while I empathize, those ships have long sailed.
What he meant is that is an unexpected condition, that should have never happened, but that did, so it needs to be fixed.
> How is someone supposed to even notice a random error log?
Logs should be monitored.
> At the places that I've worked, trying to make alerting be triggered on only logs was always quite brittle, it's just not best practice.
Because the logs sucked. It not common practice, it should be best practice.
> Throw an exception / exit the program if it's something that actually needs fixing!
I understand the sentiment, but some programs cannot/should not exit. Or you have an error in a subsystem that should not bring down everything.
I completely agree with the approach of the author, but also understand that good logging discipline is rare. I worked in many places where logs sucked, they just dumped stuff, and had to restructure them.
For impossible errors exiting and sending the dev team as much info as possible (thread dump, memory dump, etc) is helpful.
In my experience logs are good for finding out what is wrong once you know something is wrong. Also if the server is written to have enough but not too much logging you can read them over and get a feel for normal operation.
eg. log level WARN, message "This error is...", but it then trips an error in monitoring and pages out.
Probably breaching multiple rules here around not parsing logs like that, etc. But it's cropped up so many times I get quite annoyed by it.
If your parsing, filtering, and monitoring setup parses strings that happen to correspond to log level names in positions other than that of log levels as having the semantics of log levels, then that's a parsing/filtering error, not a logging error.
That said, the thing I've cone find being useful as a subcategory of error are errors due to data problems vs errors due to other issues.
I hate the concept of “errors” in general. They’re an excuse to avoid responsibility, and ship broken software with known undefined behavior.
The very notion of an error basically means “there was behavior I chose to not handle and do anything about but which I knew would happen” which is essentially just negligence.
Not everything that a library considers an error is an application error. If you log an error, something is absolutely wrong and requires attention. If you consider such a log as "possibly wrong", it should be a warning instead.
How do you know?
Expecting arbitrary services to be able to deal with absolutely any kind of failure in such a way that users never notice is deeply unrealistic.
Warning, in contrast, is what I use for a condition that the developer predicted and handled but probably indicates the larger context is bad, like "this query arrived from a trusted source but had a configuration so invalid we had to drop it on the floor, or we assumed a default that allowed us to resolve the query but that was a massive assumption and you really should change the source data to be explicit." Warning is also where I put things like "a trusted source is calling a deprecated API, and the deprecation notification has been up long enough that they really should know better by now."
Where all of this matters is process. Errors trigger pages. Warnings get bundled up into a daily report that on-call is responsible for following up on, sometimes by filing tickets to correct trusted sources and sometimes by reaching out to owners of trusted sources and saying "Hey, let's synchronize on your team's plan to stop using that API we declared is going away 9 months ago."
“External service down, not my problem, nothing I can do” is hardly ever the case - e.g. you may need to switch to a backup provider, initiate a support call, or at least try to figure out why it’s down and for how long.
If a fix is needed on your side for this matter, having a conversation with a customer might be useful before breaking more stuff. ("We have no state code in EU. Why is that mandatory?").
Bold of you to assume that there are system administrators. All too often these days it's "devops" aka some devs you taught how to write k8s yamls.
I could live with 4
Error - alert me now.
Warning - examine these later,
Info - important context for investigations.
Debug - usually off in prod.
That says it all:
- Backseat driving
- Not a developer by trade