Processing large datasets in real time
I’ve just written up a summary of our latest project on the Workingmouse wiki:
I’ve had the good fortune to recently complete a project at Veitch Lister Consulting (VLC) (a transport planning consultancy) processing large datasets in real-time.
As a consultancy we usually work in corporate environments and as such are bound by the architectural constraints of the organisation. This usually includes the usual “enterprise” constraints such as language (Java 1.4 is common), application servers (usually WebLogic or WebSphere) and databases (Oracle or DB2). However this was one of those rare projects where you have no technical constraints and have the fortune to be working with great people.
Going in, we had three basic requirements, 1) must be callable from a Rails-based web app, 2) must return responses to typical requests in real-time to the webapp (30 seconds was our target) and 3) must query against a dataset that was initially estimated at around 45 TB, but later came down to around 100 GB. Also, as we were contracted for a finite period, we also needed to make sure we had trained up the existing three developers in whatever tools we chose. I won’t be talking about the process we used, suffice to say it was XP-like, supported by tools such as my BDD framework Instinct and web-based agile project management tools.
Hey Tom,
Great article as I said before. It might be good to mention that we profiled with YourKit and got the biggest performance boost by removing redundant objects and optimizing hashCode() - so simple, yet so good for you!
Also premature optimization is totally unnecessary. None of the things we thought would really hold us back (such as text files over binary files) did in reality. Profiling with realistic data showed us exactly what should be optimized.
sanj
20 Dec 07 at 8:39 pm