Open Source Infrastructure and DBIx::Class Diagnostics Improvements

Many people know that Peter Rabbitson has been wrapping up his time with DBIx::Class after his attempt to get funding for working on it didn’t work out. I have long had some scraps of notes on a post about that whole situation and how troubling it is but I could just never make it happen. The following is the gigantic commit message of the merge of a large chunk of his work. I offered to host it since I think that it should actually get read. I have left it almost completely unchanged, except to make things proper links. More thoughts after the post.


Merge the ResultSource diagnostics rework

…And this is what the products that we make do: these are the consequences. They either empower people, or they steal bits of their lives. Because experiences are all we have in life: if you think about them as grains of sand in an hour glass, once those grains are gone – they are gone. And experiences with people and experiences with things: they use up the same grains.

That’s why we have a profound responsibility to respect the experiences of the people that we build for…

Aral Balkan: Free is a Lie TNW 2014

This set of commits is unusual - the 2+kloc of changes (in lib/ alone) do not add any new runtime functionality, nor do these changes alter significantly any aspect of DBIC’s runtime operation. Instead this is a culmination of a nearly 4 months long death-march ensuring the increasingly complex and more frequent (courtesy of rising use of Moo(se)) failure modes can be reasoned about and acted upon by ordinary users, without the need to reach out to a support channel.

The changeset has been extensively tested against 247 downstream CPAN dists (as described at the end of commit 12e7015) and against several darkpan test suites. As of this merge there are no known issues except RT#114440 and a number of dists (enumerated in 12e7015) now emitting REALLY LOUD though warranted and actionable, diagnostic messages.

The diagnostic is emitted directly on STDERR - this was a deliberate choice designed to:

  1. prevent various test suites from failing due to unexpected warnings

  2. make the warnings harder to silence by a well meaning but often too eager-yet-not-sufficiently-dilligent staffer, before the warnings had a chance to reach a senior developer

What follows is a little bit of gory technical details on the commit series, as the work is both generic/interesting enough to be applied to other large scale systems, and is “clever” enough to not be easily reasoned about without a summary. Think of this as a blog post within an unusual medium ;)

Background

Some necessary history: DBIC as a project is rather old. When it got started Moose wasn’t a thing. Neither (for perspective) was jQuery or even Tw(i)tt(e)r. The software it was modeled on (Class::DBI) has “single-level” metadata: you have one class per table, and columns/accessor were defined on that class and that was it. At the time mst made the brilliant decision to keep the original class-based API (so that the CDBI test suite can be reused almost verbatim, see ea2e61b) while at the same time moving the metadata to a “metaclass instance” of sorts. The way this worked was for each level of:

  • Individual Result Class (class itself, not instance)
  • Result Class attached to a Schema class
  • Result Class attached to a Schema instance

to have a separate copy-on-the-spot created metadata instance object of DBIx::Class::ResultSource. One can easily see this by executing:

~/dbic_checkout$ perl -Ilib -It/lib -MDBICTest -MData::Dumper -e '
  my $s = DBICTest->init_schema;
  $Data::Dumper::Maxdepth = 1;
  warn Dumper [
    DBICTest::Schema::Artist->result_source_instance,
    DBICTest::Schema->source("Artist"),
    $s->source("Artist"),
  ]
'

The technique (and ingenious design) worked great. The downside was that nobody ever really audited the entire stack past the original implementation. The codebase grew, and mistakes started to seep in: sometimes modifications (add_columns, etc) would happen on a derivative metadata instance, while the getters would still be invoked on the “parent” (which at this point was oblivious of its “child” existence, and vice versa). In addition there was a weird accessor split: given a result instance one could reach different metadata instances via either result_source() or result_source_instance(). To add insult to the injury the latter method is never defined anywhere, and was always dynamically brought to life at runtime via an accessor maker call on each individual class.

If that weren’t bad enough, some (but crucially not all) routines used to manipulate resultsource metadata were proxied to the main Result classes, also aiming at allowing the reuse of the existing Class::DBI test suite, and to provide a more familiar environment to Class::DBI converts. The complete map of current metadata manipulation methods and their visibility from a typical ResultClass can be seen at the end of commit message 28ef946.

The downside was that to an outsider it would seem only natural that if in order to make something metadata-related happen, one normally calls:

SomeResultClass->set_primary_key

then it makes sense that one should be able to override it via:

sub SomeResultClass::set_primary_key {
  my $ret = shift->next::method(@_);
  { do extra stuff }
}

That thinking has been applied to pretty much all straight-pass-through getters in the wild, with the expectation that DBIC will respect them throughout, like e.g.. In reality this never happened - half of DBIC would never even look at the Result class and instead simply called the needed method on the result source instance directly. As noted in 28ef946: the overwhelmingly common practice is to hook a method in a Result class and to “hope for the best”. A rare example of “doing it right” would be DBIx::Class::ResultSource::MultipleTableInheritance, but as can be seen from its SYNOPSIS the API is rather counterintuitive (what is table_class() anyway?!) and more importantly - the earlier example seems “just right”.

Another innovation (remember: pre-Moose) was the use of the just-in-time implemented alternative C3 method resolution order (MRO) right on top of the default perl DFS MRO. While DBIC used multiple inheritance (MI) from the start, with all the corresponding problems and non-scalable “solutions”, it wasn’t until C3 MRO became available that the true potential of the resulting plugin system became clear. To this day (mid-2016) MI, as used within the DBIC ecosystem, remains the single most flexible (and thus superior given the problem domain) plugin-system on CPAN, easily surpassing rigid delegation, and having an upper hand on role-based solutions as promoted by the Moo(se) ecosystem. It must be noted that delegation and/or roles are not without uses - they are an excellent (and frankly should be a default) choice for many application-level systems. It is the mid-level to low-level libraries like DBIC, where the stateless nature of a predictable yet non-coordinated call-order resolution truly begins to shine.

Problem(s)

Things stayed undisturbed for a while, until around 2012~2013 folks started showing up with more and more complaints which all traced to Moo(se)-based subclassing. Originally the C3 MRO composition worked just fine, because almost invariably a ->load_components() call (which explicitly switches the callER MRO) would have happened early enough in the life of any end-user Result/ResultSource class. But when extends()/with() got more prominent this was lost. The more complex the inheritance chain - the more likely that the topmost leaf class is in fact stuck under DFS mro with everything going sideways from there. Sometimes with truly mindbending failure cases. There was no clear solution at the time, and aside from some toothless documentation warnings nothing was done to address this (in fact even the doc-patch itself is incomplete.)

The inconsistencies, and the resulting mistakes, however, were all localized, and even though the problems were often major, each instance was sufficiently different (and bizarre) that each individual deployment could neither report them properly, nor find the time to reason through the layers of history in order to arrive at a solution they fully understand. Yet the original design which solidified towards the end of 2007 was just good enough to keep being kicked down the road.

But people kept writing more and more MOP-inspired stuff. Given the general tendency of perl code to get “all over the place”, the desire was only natural to standardize on “one true way” of doing OO throughout an entire end-user project/app. And there were more and more ways in the wild to combine/abstract individual Result classes and ResultSet components. The comprehensive DBIx::Class::Helpers are just the tip of the heap of all possible permutations DBIC is exposed to. Towards mid-2015 it became utterly untenable to brush off problems with “meh, just don’t do that and all will be be fine”.

On the personal front I first ran into the baroque jenga tower head-on when I tried to make sense of the ResultSource subsystem in an airport lounge pre-YAPC::EU 2011 (Riga). I honestly do not remember why I started digging in this direction but the result of that attempt (and the later effort to revive it) got immortalized in my local tree. Enough said.

Next was the dash to implement sane relationship resolution semantics in 03f6d1f, and then in 350e8d5 (which was actually needed to allow for d0cefd9 to take place… sigh). During that journey 4006691 made a subtle but fatal in the long run change - it upset the balance of which source instance object we looked at during some (but not all) codepaths. The really sad part is that I had the feeling that something is not right, and even made a record of it as the last paragraph of 350e8d5. But light testing did not reveal anything, and I irresponsibly shipped everything as-is a bit later. It wasn’t until Oct 2015 that someone noticed this being an actual problem. Early attempts to fix it quickly demonstrated just how deep the rabbit hole goes, and were the main reason the entirety of this work was undertaken: the accumulated debt simply did not leave any room for a half-way solution :/

Solution(s)

The writeup below describes only the final set of commits: it does not cover driving into and backing out of at least 3 dead-ends, nor does it cover the 5 distinct rewrites and re-shuffles of the entire stack as more and more involved testing revealed more and more involved failure modes. I must stress that if you plan to undertake a similar crusade against another projects architectural debt you are in for a rough (but not impossible!) ride. The height of the “tenacity-bar” necessary to pull off such work is not reflected in any way within the seemingly effortless walkthrough that follows. It is also worth acknowledging that the code at times is incredibly terse and hard to follow: this was a deliberate choice as the extra diagnostic sites that are enabled during runtime had to be implemented as “close to the VM”, so to speak, as possible. In isolation none of the contortions are warranted, but because I ended up with so many of them the result does pay off. See comments within individual commit messages for various performance impacts for more info.

As first order of business some mechanism was needed to track the logical relationship between the 3 levels of ResultSource instances as shown earlier in this writeup. Luckily, the user-unfriendly nature of the metadata stack meant there are very few spots on CPAN (and to the best of my knowledge on DarkPAN) that do anything exotic with the subsystem. This means the simplest thing would in fact work and was implemented as 534aff6: corral all instantiations of ResultSource objects (and Schema objects while we are at it.) This code ensured that nothing in the stack will create an instance of either class-type without our knowledge. With that in place, we also provide an explicit clone method encouraging folks to use that whenever possible. The switch of all relevant callsites within DBIC itself was verified through another check within new, guarded by the same compile-time assertion constant (which in turn was provided by both the CI and the local smoke-script from 5b87fc0)

With the above in place, ensuring 99.99% of the ResultSource “derivative” instances were obtained via $rsrc->clone, it was time for 0ff3368. A simple private registry hash with object addresses as keys and this hash as values:

{
  derivatives => {
    addr_derived_rsrc_1 => $reference_to_infohash_of_derived_rsrc_1,
    addr_derived_rsrc_2 => $reference_to_infohash_of_derived_rsrc_2,
    ...
  },
  weakref => $weak_reference_of_self,
}

As necessary for any structure holding addresses of object references, a CLONE “renumbering” routine takes care of keeping everything in sync on iThread spawns (if you believe that iThreads are evil and one shouldn’t go through the trouble: be reminded that any call of fork() within a Win32 perl is effectively an iThread, and fork() can and is being called by some CPAN modules implicitly).

Now that we had a good handle on “what came from where”, the first major diagnostic milestone 73f54e2 could be covered. As can be seen in the table of methods in commit 28ef946 there are only a handful of attributes on an actual ResultSource class. A couple new Class::Accessor::Grouped method types were added, which would behave just like the ‘simple’ and ‘component_class’ they were replacing, but with a twist:

The result is the exact warning as described in commit message 73f54e2. Of course there are some extra considerations - some high-level setters (e.g. remove_columns) do call a getter underneath to do their job. These cases had to be short-circuited by using a local()-based “setter callstack” mark. But in general the changeset has been surprisingly non-invasive: once the proper hook points were identified the rest was a breeze. There was also a brief scratching of heads when the last stages of DarkPAN tests emitted errors which I myself could not explain for a while, until the reason (and trivial solution) were identified in d56e05c and here.

As a brief detour, I considered switching ResultSource to a proper Moo class, but quickly abandoned this idea as there are no provision for clean get-time triggers. Nevertheless the attempt was a useful demonstration what does it take to switch a low-level class (which means many somewhat questionable uses by consumers in the wild) to Moo(se) with zero loss of functionality. The result is preserved for posterity as 8ae83f0e.

While working on the above and f064a2a (the solution to RT#107462), it occurred to me that the confusion of having both result_source_instance() and result_source() can be reduced further by forcing all “getter” calls to go through result_source() which is defined in Row.pm and is thus always available. The result was the improved diagnostic as described in the commit message of e570488, but also a useful set of assertions that were used to weed out many of the wrinkles.

The next major step was to resolve once and for all the fallout from incorrect inheritance composition. The highly dynamic nature of all Perl’s programs, an “eternal compile/execute/compile/execute… cycle”, meant that just “fixing things” as DBIC sees them would not work - calling set_mro() could do little when called late enough. This led to the revert of the originally-promising “forced c3-fication” of the stack 7648acb. Instead the practical design turned out to be “let the user know and carry on”.

The first part of getting there was to devise a way to precisely and very quickly tell “what does a class look like right now?” I have been brooding over how to do this since mid-February, but it wasn’t until I noticed the excellent App::Isa::Splain by @kentfredric, that the final interface came into focus: 296248c (with several minor fixups later on). Here I want to take a moment to apologize to @kentfredric, as he was led on a several week long wild-goose chase due to a misguided comment of mine :(

Amusingly while implementing this I hit a wall related to perl 5.8 (for the first time in 6+ years): As stated in the timings at the end of commit message 296248c and as elaborated here - the non-core MRO is just too expensive to work with. This resulted in a 1.5 week long detour to try to squeeze every last ounce of performance. Amusingly I ran into a lot of “interesting” stuff along the way. The result was not only a semi-usable 5.8 implementation, but even running on 5.10+ was sped up about 2 times in the end, which translated into tangible gains in the end: the number cited as 16% in 12e7015 was originally 28%(!). The moral of this story? - gerontoperlia makes your modern foundation code better.

With a reliable way to tell what each methods “variant stack” looks like, it was trivial to implement the valid_c3_composition part of ::SanityChecker - one would simply check a class’ MRO, and in case of dfs compare all stacks to what they would look like if the MRO were c3.

In parallel but unrelated to the above the ever increasing tightening of various DBIC internal callpaths (e505369, d99f2db, 3b02022) had to be addressed in some way. The urgency truly “hit home” when testing revealed RT#114440 - it was nothing short of a miracle this code survived that long without being utterly broken by other components. The solution came out of crossing the work on describe_class_methods (296248c) with the concept of the fail_on_internal_call guard (77c3a5d). We already have a list of method “shadowing stacks” (to borrow @kentfredric’s terminology) - if we find a way to annotate methods in a way that we can tell when a “non-overrideable” method was in fact overridden - we will be able to report this to the user.

The somewhat fallen out of favor subsystem of function attributes was chosen to carry out the “annotation” task. It must be noted that this is one of the few uses of attributes on CPAN that is architecturally consistent with how attributes were originally implemented. An attribute is meant to attach to a specific reference ( in our case a code reference ), instead of a name. This is also why the FETCH/MODIFY_type_ATTRIBUTE API operates strictly with references. As an illustration why tracking attributes by name is fraught with peril consider the following:

perl -e '
  use Data::Dumper;
  use Moose;
  use MooseX::MethodAttributes;

  sub somemethod :Method_expected_to_always_returns_true { return 1 };

  around somemethod => sub { return 0 };

  warn Dumper {
    attributes => __PACKAGE__->meta->get_method("somemethod")->attributes,
    result => __PACKAGE__->somemethod
  };
'

It should also be noted that as of this merge describe_class_methods lacks a mechanism to “see” code references captured by around-type modifiers, and by extension the “around-ed” function’s attributes will not appear in the “shadowed stack”. A future modification of Class::Method::Modifiers, allowing minimal introspection of what was done to which coderef should alleviate most of this problem.

Once all relevant methods were tagged with a DBIC_method_is_indirect_sugar attribute in 1b822bd, it was trivial to implement the schema sanity check no_indirect_method_overrides which simply ensures no user-provided method “shadows” a superclass method with the sugar attribute set.

The success of the attribute-based approach prompted a pass of annotating all the methods DBIC generates for one reason or another: 09d8fb4. Aside from enabling the last improvement, it also allowed to replicate a part of the DBIx::Class::IntrospectableM2M functionality in core, without elevating the status of the m2m sugar methods in any way (the historic labeling of these helpers as relationships is a long standing source of confusion). See the commit message of 09d8fb4 for a couple use-cases.

The last piece of the puzzle 28ef946 addressed the “override and hope for the best” duality of ResultSource proxied methods as described at the start of this writeup. What we essentially do is add an around() for every method in ResultSource, which then checks whether it was called via ResultSourceProxy (inherited from DBIx::Class::Core), or directly via the ResultSource instance: i.e. MySchema::Result::Foo->proxied vs $rsrc->proxied IFF we are called directly and there is an override of the same method on the currently-used $rsrc->result_class we either follow one of the options as given by an attribute annotation, or we emit a diag message so that the user can do something about it.

That was easy wasn’t it?

Final Thoughts

This work took about 50 person-days to carry out, and for obvious reasons expanded to cover a much larger period of actual wall-time. While I am by far not the most efficient developer that I have met, I am pretty sure that the process of planning, designing, implementing and testing all of this could not have been significantly accelerated. Even at the (laughable) rate of $50/h The Perl Foundation is willing to pay for unique talent this endeavor would cost at least $20,000 USD - way beyond the scope (and aim?) of a TPF grant. On the other hand it would be surprising if this work can be qualified as unnecessary. I personally estimate that the savings due to the proper diagnostics alone will “make up” for the effort within the first month of wide deployment of these improvements. Time will tell of course, as the stream of questions is only about to start come the first days of August.

In any case - this project is by far not the only one in dire need of such “humane” overhaul. Moo, Catalyst, various pieces of the toolchain, and other staples of what is known as “modern perl5” are in similar or worse shape: a situation which can not be rectified simply by “writing patches” without a concerted effort directed by a single dedicated individual.

I yet again strongly urge the “powers of Perl” to reconsider their hands-off approach to funding the consistently shrinking pool of maintainers. PLEASE consider stealing (in the spirit of tradition) the proven successful model of RubyTogether before you end up losing even more maintainers like myself.

Peter “ribasushi” Rabbitson

Outgoing maintainer of a cornerstone Perl5 ecosystem project


Here are a few takeaways from the post from my perspective.

Volunteer Burnout

People get burned out all the time. The causes are myriad: not enough rest, depression, unrewarding work, etc. I feel like OSS and volunteers in general get burnt out for more specific reasons.

First off, they aren’t getting paid, so they tend to be doing this OSS in their rest time. ribasushi tried to resolve this, but companies who use OSS were unable, unwilling, or too out-of-touch to fund him in his efforts.

Second, I think that people who are volunteering get an unreasonable amount of responsibility put on their shoulders. This is internal: feeling guilty for not fixing a bug or whatever. (I know ribasushi suffer’s from this.) And ashamedly, external: when people point the finger and demand work. I have seen this happen within the Perl community and am disgusted that someone would demand a gift like this. Luckily for the individual, I didn’t save the link in my notes. (Ask me privately and I may relay the story; I recall it, and it’s public, but don’t think it’s worth the effort to preserve the event in infamy.)

OSS Infrastructure is Not Solved

Open Source is great for scratching your own itch. I consider my body of work to be a testament to this. In general, most of my own effort is small bits of effort that pay off well for me, and maybe a few other people. I think this is often why OSS software is often hastily put together. The entire idea is predicated on New Jersey Style work, of which I am an unashamed believer in.

The problem comes with things that are too complicated to be built this way. A webserver is simple. An ORM that supports many databases as well as DBIx::Class is not. I think it is reasonable and unsurprising the the DBIx::Class tarball is over 4 times as big as Plack. The problem is that the money for pure OSS (that is, stuff that is not using a freemium model like Chef or Puppet) is fairly scant.

The Linux Kernel is funded, but honestly not really, because the majority of the contributors are paid to work on it for hardware or software their employers use. A model where a single individual (or even a small group) support a piece of software that many companies use does not seem to be common. We all know what happened to OpenSSL in 2014, and that is another critical piece of OSS infrastructure. There has been some effort to fund it, but given that it’s so foundational compared to DBIx::Class, I just cannot see that model working for DBIx::Class either.

Stack Decoration with Attributes

I think this is a really interesting technique and think that it would be an interesting way to, for example, filter backtraces. Like it would be nice not the have Try::Tiny in the middle of all my backtraces; this technique could resolve that. The current method involves weird regular expressions to filter packages etc. Could be cool.


I know this has been a long post. I hope some people read it and consider how some of these problems can be solved in the long run. For my part, I bid ribasushi a fond farewell. I do not have any problem with how long his work has taken except that was more time stolen from him to live his life. If DBIx::Class does indeed get a new janitor, I wish him or her luck. These are big shoes to fill.

Posted Mon, Aug 1, 2016