Random thoughts: Interesting week ... and it is only wed !

It all started with a small observation last friday night ... was randomly sampling some data I had generated and noticed a mismatch : was too pissed, so just logged out and did not check anything till Sunday.

So come Sunday and the week starts off with one potential bug, one 'crazy idea' in the mail, a new clarification issue and one pending task. Being the procrastinator that I am, and because the idea was really quite 'crazy' and interesting, I started off with that and discussed/designed/hacked away until we ended up with something pretty neat - and which might become something quite useful for our future projects : was damn happy & cocky ... what could go wrong now ! oh boy - the shitstorm waiting for me blindsided me completely.

For one, the clarification issue, though kind of braindead and trivial from strict URI/HTTP pov ended up being quite messy - normal conventions on the web interacting with slightly long extraction pipeline combining multiple 'methods' and existing yahoo infrastructure ... kicked up quite a bit of dust : though after two days, I hope it is settled now !

In the middle of all this, we still have our apparently "innocuous" bug - the star of the show, revealed only later. The reason I thought it was harmless was 'cos I had fixed something similar - multiple times actually in past because of pig bugs & some side effects of it. The basic problem is something like this -
You start with N entities, and generate candidates each; and features for each entity,candidate pair. The features are generated as multiple vertically partitioned files - like :
file1 having (id, candidate, feature1_0, feature1_1);
file2 having (id, candidate, feature2);
and so on.
So we combine the results finally as a join on (id, candidate) to generate
output_file having (id, candidate, feature1_0, feature1_1 feature2, ....)
(Note, this is done with data which can be webscale - so cant use DB; and number of 'tables' can be quite large).

To give some context, the project uses pig - which makes use of hadoop which is running on a fairly large cluster (1k+ nodes) - though you dont use the whole cluster for all your jobs, just how much you need.

Something as simple as this, ended up with hitting quite a lot of parsing bugs & unexpected pipeline corner cases in pig - so the number of workarounds in our scripts & code was already pretty high.

Given this past history, I was expecting to be able to 'solve' it reasonably quick - heck, I had done it three times already, cant be tough can it ?! Inspection, careful checking, etc lead to nothing.
Ok, good old fashioned debugging - let us look at data generated at each step (thankfully, we 'need' to checkpoint output in pipeline).
So go through the snapshots in the pipeline, and narrow down to a small-ish section where the issue could be present after a few hours - generate a small snippet just to test that and execute it; and off to bed.

14 hours later, what had initially taken about an hour on 50 machines (the full pipeline) is still chugging along at 10% completion on the snippet ! This cant be right, for obvious reasons ...
A few retries later, it is the same ... causing me to attempt to move to another cluster; with no success with the move (that one is too 'full').

In retrospect, it seems I had hit a hadoop bug - and weirdly, others have hit it too - though they just did not seem to notice that their jobs which should take 1 hour to finish was taking 48+ on 100 nodes !

So now we have one hadoop bug, one potential pig/my bug - found a (lucky) workaround for the hadoop issue after more than a few hours of wrangling with the cluster and restarted the jobs ... and now debugging away in earnest.
We hit an interesting issue, paraphrased below :

---
A1 = load input1
A2 = load input2

J1 = join A1 on some_id1, A2 on some_id2;

A3 = load input3
A4 = load input4

result = join A3 on id_1, A4 on id_2, J1 on id1
---

... or something to this effect.
input* to this snippet is fine, but result has only about 30% expected tuples !
So either mucking up the join's (they are outer joins with missing value fills), or a pig pipeline bug ?!
Apparently, a pig pipeline bug - requiring us to store the J1 output into intermediate file and load off it again to 'fix' the issue. By now, it is about 52 hours since start with about 9 hours of sleep (we have a deadline coming up) : and that ended the 'bug'.

So let us recap, a hadoop/grid issue and a pipeline issue - unrelated to eachother, and to our code - hitting us in the same shot ! I would have been happy if there was something, anything, wrong with my code to cause/trigger this ... would have made it feel like it was partly my fault (yes, you do feel this way sometimes too).

So all we now have is the pending task, a really interesting problem and I am going to be so screwed 'cos of it .... as I said, this is just middle of an interesting week !

Random thoughts

Wednesday, April 29, 2009

Interesting week ... and it is only wed !

2 Comments:

About Me

Previous Posts