Node Js And Hof Discussion

NodeJs (a JavaScript kit) and HOF Discussion

Continued from ArrayDeletionExample


An absolute demonstration of eval()'s insufficiency and the power of closures is NodeJs, a platform for server-side JavaScript. The genius of Node is that every single I/O call is non-blocking; rather than returning its result directly, it takes a callback function, reminiscent of ContinuationPassingStyle. For example, reading from a file looks like this:

fs.readFile("path/to/file.ext", function(err, data) {
// do stuff with file data in here
});

Because every call is non-blocking, like the above, Node essentially has concurrency ForFree. It's a fantastic way to write all sorts of applications, especially servers. And it's absolutely 100% impossible for any eval() system to achieve this kind of evented structure (at least, without actually being a closure system in disguise).

To show exactly why eval() can't possibly do this, here's a code example from a simple Node server:

http.createServer(function(req, res) {
res.writeHead(200, {'Content-Type': 'text/html'});
withImage(function(img) {
res.end("");
});
}).listen(8000);

Notice that the bolded variable is accessed in both the outer and inner scopes. Without LexicalScoping and proper closures, that variable would be gone by the time withImage() fetched the image and called the inner callback. With those two things, this code works perfectly and is quite natural.

-DavidMcLean

I almost never had a real problem with concurrency for web apps or client/server architecture. If it becomes an issue, then I rewrite it to use the concurrency-handling capability of the database, such as ACID and transactions. You appear to be using a file system for data "storage" when a database would perhaps be more appropriate. If you need to store content such as photos with concurrent uploads, then generate a unique ID first in the image or article tracker or record (usually article ID), and use that as part of the destination file name to avoid user collisions. (It's also possible to store images directly in a database, but that's not always an option. There are other approaches, but I'd have to see the the domain details to recommend something specific.) It's common for us CBA developers to rely on the concurrency engine in the RDBMS for concurrency needs. Why do you HOF fans keep trying to sell refrigerators to us Eskimos? -t

Ah, figures. The fact that I mentioned files at all clearly means I advocate them over databases for all data storage. I merely selected the file-read function to demonstrate in a simple way the callback-oriented I/O; in retrospect, this was a very bad move considering to whom I'm talking.

Node's non-blocking I/O doesn't just apply to files. It works equally well with any kind of I/O, including databases, and it does so using the exact same callback design. That withImage() function actually fetches an image URL from the Google Images API, for example, not from anything on the local filesystem (unless your local filesystem just happens to be a Google datacentre).

Actually, given what I've seen of TableOrientedProgramming, I think Node would be an excellent fit for it. TOP clearly is full of database queries. You know how whenever you make a database query there's network overhead? When you use Node, that network overhead is put to good use. While it's waiting on the results of a query, your app is free to receive other requests, perform other computations, and so on. Node's weak point is stuff that requires a lot of processing directly in-app, rather than through I/O, since its magical concurrency design doesn't inherently include threads. Because TOP involves having the database do as much processing as possible, Node's weakness is dodged nicely.

-DavidMcLean

I'm not sure what you mean by "network overhead" here.

Re: "While it's waiting on the results of a query, your app is free to [do other things]". That hasn't been a common (explicit) need for the web-apps I've worked on. But it can possibly be handled with "hidden" frames or iframes. Web pages make it pretty easy for the user to "spawn" other pages if they want while waiting. One can put up a page that says, "While waiting for your request to process, here are some other articles/topics you can peruse". In-line ads already do this pretty much. Just make sure to set "target='_blank'" so the original "waiting" page is not wiped out.

On client-server apps, one typically makes the long process spawn a secondary process and "status screen" that says something like, "Your query/report is being processed, please wait...". When it's ready, then a notice and a "View Report" button comes up on that secondary screen.

You send a request to your database. There is some delay, due to the time taken to deliver the request and response across the network. You receive a result from your database. Network overhead. (Also, there's overhead in the DBMS itself, especially as queries grow more complex. There's plenty of overhead in the whole process for Node to exploit.) Under most platforms, that network overhead slows down your app, as synchronous code has to block while it waits for results; under Node, overhead gives your app the chance to get some work done, processing events from the queue.

You say there hasn't been an explicit need for concurrent behaviour in your Web apps, but in fact all Web apps need to be concurrent, because they will inevitably receive requests from more than one user. The "explicit" part, I think, highlights that you've missed the point a little: Why should you have to be explicit? It's a Web app; of course you're going to want decent multiuser handling. Node's evented structure means that you get multiuser concurrency in your app server implicitly, without needing to give it special consideration.

While I cannot condone using frames to implement asynchronous Web-apps (you do know what AJAX is, right?), the client-server technique you describe is a sane one. However, the idea of offloading complex processing to a separate process is the crux of Node's nimble event-loop, so that design automagically works in Node without much special server setup! The platform makes it extremely natural and clean to set up many concurrent situations, including the one you describe.

-DavidMcLean

Maybe if we explore some kind of semi-realistic scenario. The devil's in the details of the domain needs. -t

You'd like a specific scenario? Okay. Here's a simple example:

var db = require('db');
var http = require('http');
var s = http.createServer(function(req, res) {
res.writeHead(200, {"Content-Type": "text/html"});
db.query("SELECT * FROM tweets", function (err, rows) {
for(var i = 0; i < rows.length; i++) {
var tweet = rows[i];
res.write("

" + tweet.text + "

");
}
res.end();
});
});
s.listen(8080);

The above code, based somewhat on an example program from the book Node Up And Running, implements part of a Twitter clone in Node; the design is fairly conservative so as not to distract from the following explanation of the eventing process (a production version would almost certainly use some HTML templating system, rather than generate HTML directly inside the query callback, as this example does, for instance). This basic code concept would be perfectly applicable to report-writing business applications, among others.

Let's look at how the above code is processed, under Node. When the application starts, practically all of the code shown above is run almost immediately. This "first pass" in Node usually does very little actual processing; in this application, the first pass sets up some callbacks and tells Node to listen on port 8080. After the first pass completes, in any Node application, Node will stop and wait for events to show up in its event queue.

Now our server is up and ready to use. Let's have client A make an HTTP request.

Now, one of two things can happen next:

That about covers the Node eventing model. The most important point is that the only time Node isn't doing something is when there is nothing to do, when there are no events waiting in the queue to be processed. Node never stops to wait for a database query, or file read, or call out to a shell script, or any of those things. It's essentially always getting something done.

It should be apparent from the above example why Node has such good concurrent performance, as well as why it's weakened by applications needing significant processing directly within Node rather than through external processes.

-DavidMcLean

No no no no. I'm not looking for a technical-only demonstration. I want to see it solve a stated semi-realistic business need. Something along the lines of, "I'm a warehousing company and need a report that shows the both the dfasdf and the asdfasdf every third Tuesday of the month because that's when our fasdfjk deliveries roll in. Here is how to best satisfy this need..." Your demo is more along the lines of "I can print 'Hello World' 0.02 seconds faster!" Which may be neat in a MentalMasturbation kind of way, but is not shown solving a real need out in the field.

You asked "How exactly does it improve performance noticeably?", so I assumed you wanted an explanation of how exactly it improves performance noticeably. Regardless, the example code is easily applicable to business applications, as I already mentioned; it's essentially a (very simple) report writer, much like Challenge #6. (I chose this specifically with the expectation that you would consider report writers business apps, so I'm a little surprised at your response.)

It seems odd to complain that my demonstration shows a way to produce results faster, when I'm demonstrating a concurrency system. Replacing your strawman description with something actually describing what I've just shown, such as "I can print your business's reports seconds-to-minutes faster", should show that what I've provided is actually a highly desirable quality in a business app.

-DavidMcLean

Sorry, I don't see what bottleneck it's allegedly plugging. What exactly is serial before and now parallel under your gizmo? The delays in most custom web apps are NOT caused by slow server app processing, but the delay of bytes across the network "wires", and secondly the database processing. Now databases can and do use parallelism, but you have to have the hardware and indexes etc. set up properly, which will probably be an issue in any shared, multi-user data "storage", and databases are probably more mature in that department. Most of the same kinds of decisions and trade-offs will be relevant to any thing that fulfills that role. Parallelism and concurrency is often not a free lunch: it can require more discipline and planning. Let's not spend our time coding/tweaking around with such if it's not necessary because single-threading and serial processing are usually easier to debug and grok. Thus 'one shouldn't parallelize willy nilly.'

Further, most web servers already do parallel processing because each user's request is or can be a separate process. 'Splitting further at the sub-user level doesn't gain one anything. Say there are 16 current users and 8 processors. The web server (IIS or Apache etc.) will typically split these 16 users among the 8 processors such that two users (or requests) are assigned per processor. If you split at the sub-user level also, then you simply have more little processes waiting in line. 10 2-inch processes waiting in line is not going to be better than 5 4-inch processes waiting (especially since they can be allocated to different CPU's when ready). Now when typical servers have say 100 processors, it may start to be a help, but the app code (outside of queries) is generally not doing anything processor intensive anyhow such that 90 of those will be doing NO-OP's (or working on database requests). We're not calculating pi to 500 million decimal places or predicting the weather 2 weeks ahead. If your app code (outside of queries) is the performance bottleneck, then usually you are doing something wrong, or at least the hard way. Typically the sin is not taking advantage of the database's capabilities and instead doing mass DatabaseVerbs in app code. CBA App code should mostly be like a building receptionist: guiding inputs, outputs, and service requests to their proper destination based on the business rules. The receptionists should not be processing mass piles of paper or forms. If they are doing those kinds of tasks, then you are misusing them.

A similar issue comes with queries per unit of time. Sub-user parallelism may be able to submit more queries per second, but if all the other users are doing also submitting queries, then we are right back to the same kind of problem. From the database side, the number of queries it can process per second (from multiple users) is going to be a far far bigger factor than how many queries we can submit to the database (queue) per second. It doesn't address the actual/typical bottlenecks. It would be roughly comparable to bridge toll booths. We can make the highway going up the the booths wider (more lanes, say 12), but if we only have 6 booths actively processing (the database), then the same number of cars are still coming out the other end per minute (after they pay their tolls). We are not noticeably reducing a person's trip time. The wider feeder highway is mostly a waste of space and tax money. If the traffic is light enough not to bottle up the toll booths, then most likely the 12 feeder lanes will be sparse and wasted such that they are not helping the lighter-traffic trip times either.

It's rarely economical to spend your resources/time on the non-bottlenecks.

[Top, database processing is not the bottleneck per se -- even low-end DBMSs on low-end hardware can handle dozens of simultaneous queries. The bottleneck is within applications designed to work serially: They wind up waiting for apparently unrelated parts of themselves. This is exemplified by having to wait for content A to load before you can work on unrelated content B, where users are typically forced to stare at some irritating spinning "please wait" icon or irrelevant substitute content. It's because all the application's processing is serialised, thus forcing the user to perceive the cumulative effect of various inevitable small delays. Concurrent event-driven models essentially eliminate these.]

[What's notable about using concurrent event-driven models with proper support for higher-order functions -- as with AJAX and appropriate client-side libraries, Node.js, Windows 8 modern-style apps, and so on -- is that massive concurrency is essentially free. Once you grasp the underlying approach, it is no more difficult to develop and debug them than conventional "serial" applications, but the improvement in responsiveness and fluidity represents not only a better user experience, but a potential competitive advantage. It doesn't matter if you still develop apps that permit the same overall throughput as your competitor (which, of course, might even be a fellow developer within your department.) If her apps are perceived to be fluid and responsive, but your apps are perceived to be clunky because they stall and/or show a tumbling hourglass, who's going to win?]

You must be doing something weird. In my typical biz web apps, the database is usually the bottleneck. If I comment out the database portions, the rest runs in a snap. I do this frequently when tuning page esthetics. Browsers already have built-in parallelism and speed short-cut mechanisms and one can leverage this. Your parallelism GoldPlating is a waste of programmer time. (There are other ways to speed up perceived and real web-page rendering without JS and HOF's, but we are wondering off topic here.)

And parallel algorithms are in general more difficult to debug because the order can be different on each run. It's like trying to do science in which you cannot isolate one variable because the other variables keep changing upon each test. HeisenBug risk. We can limit these problems by some extent by making certain assumptions and sticking to certain rules, but we are then accepting down-sides to gain the benefits. If the benefits are small for a particulate situation, then the down-sides are not worth it. Even in theory if two processes/events should be "parallel safe", clients (browsers & GUI engines) are often buggy such that events can cross-effect each other. I don't want to take my chances with potential HeisenBugs unless the payoff is fairly large. -t

I think you'll find the database isn't the bottleneck in your applications. 'Waiting for' the database is. If you use an event-driven concurrency system, you never wait for the database, because there's more important stuff to do. As for difficulty in debugging, the callback-oriented structure means that coding under these systems is actually very similar to coding purely serial code. (Because they're still single-threaded, you don't have to worry about thread safety; many threading-related issues vanish when using single-threaded concurrency, partially because closures keep state encapsulated and local.) However, in evented-callback systems, you can be sure that any particular piece of code is only delayed by the things it specifically depends on; if a query is needed to show page A but not page B, then page B won't be delayed by attempts to construct page A.

Heck, even constructing a single page can have fewer delays. Suppose page A requires several database queries (Plus maybe some other external source. Perhaps some HTML sourced from the Google APIs?). In serial code, you'd make each of those queries in sequence, one by one, waiting for each query to complete before making the next one. With concurrent code, you can make all of those queries at once; databases themselves can easily handle multiple queries at once, so you'll get your results much faster than if you made the queries one by one.

-DavidMcLean

I'm sorry, I still don't know what you are talking about. What is the "more important stuff to do"? Running a Honey Boo Boo video while the report is being generated? If I as a user click "View Report", I don't want to see Honey Boo Boo because I didn't ask for Honey Boo Boo. The button lied. (Don't need HOF's for that anyhow.) The bottom 10% of users get confused and call the help desk if too much is going on at the same time. If this is about making dancing spam more "efficient", I'm not interested in that topic today.

At least in HofPattern, the multi-panel real-time status monitor scenario was something of utility. We just disagreed about whether a JavaScript client should be the reference point for measuring "good". I'd like to get a way from GUI-intensive scenarios if possible because the trade-offs depend heavily on the client technology being used or available, which greatly complicates the comparison. A GUI of some sort is fine in a scenario as long as it doesn't become the focus point of the differences. If HOF's are mostly about making GUI's/UI's "better" in CBA, then perhaps we should spawn a more narrow topic on that alone.

No, stupid. Like I've already explained, basically anything the app needs to do is more important than sitting around waiting for a query. Receiving a request from another client, perhaps. Or finishing off a request from another client, because the results of 'that client's query just came back. Or, to use an example from my previous paragraph that I was sure you'd love, using the time waiting for one database query to make another database query. I have no clue why you're discussing advertising so much; I'm beginning to suspect it's a strawman tactic. -DavidMcLean

What biz scenario would you want the app to do that? Why get report B if I, the user, only asked for report A?

I didn't actually say that, but there's an obvious reason you'd want that to happen: if a 'different user' asked for report B, then of course you'll want to retrieve that report as well.

Use query results caching. Most RDBMS support it. Even if it didn't, I don't see what you are trying to do from a business reason perspective.

… what does query-result caching have to do with anything we've been discussing, in the slightest?

It should really be obvious from a business perspective why you'd want this. It both speeds up construction of a single report and allows for multiple users to request reports simultaneously (or for a single user to request several different reports in separate tabs, which is equivalent from an HTTP perspective). Unless you actually want slower response times from your software, the value of evented-I/O concurrency should be apparent at this point. -DavidMcLean

It sounds to me like a similar issue in the HofPattern topic re the multi-panel real-time monitor screen matrix scenario (AKA: Brady Bunch intro). However, I cannot be sure without specifics. There are different ways to skin the cat, and the choice depends on the domain details/requirements. If we are forced to use JavaScript as the client, then yes we may have to pretty much use HOF's, but that's a client-specific issue and I don't want to explore client specifics/limitations, I want to explore solving CBA problems in a more general sense, not compare browsers to VB to PowerBuilder to Delphi etc. Other than that, I have no idea what the hell you are getting at. You called me "stupid" and I am itching to retaliate at this point. Where's my breathing exercises link? Break your scenario down step-by-step: who, what, when, where, and why. See UseCase. If you want to communicate, roll up your sleeves and do it right. If it turns out your claims are client-specific, then I am bailing out.

The "stupid" comment was in direct response to your suggestion that the only useful thing for an app to do while it waits for queries is play a Honey Boo Boo video. I mean, come on. Why would you jump immediately to something as random and worthless as that, when we'd 'already gone over' a lot of more useful things? It's either stupidity or trolling, and I chose to attribute to ignorance what I could instead have attributed to malice.

I don't care what we're using on the client, and it's irrelevant to the topic at hand. Nothing about anything we've mentioned is client-specific. Since we've been talking about Web apps, the client probably would indeed use JavaScript, but there doesn't necessarily need to be any client-side scripting going on in these apps. Note that Node can do stuff other than Web apps: There are libraries for building a more traditional desktop GUI, the ability to access stdin and stdout for writing command-line apps, as well as provision for TCP sockets such that HTTP isn't the only option for servers. It's a very flexible platform, although Web apps are the usual choice.

Concurrency through evented I/O is a general pattern. It doesn't really need to be plugged into specific UseCases to be demonstrably useful; it has already been explained how evented I/O can improve the performance of a report-writing application, however. -DavidMcLean

I'm sorry, I don't see what's explicitly being improved. You seem to be making some unrealistic assumptions. Parallelism alone is no guarantee of speed improvement. That's why I want to walk through a specific scenario. You are being too general and vague. I'm fucking tired of foo/lab examples of FP being great. I want real beef from a real goddam cow!

I already gave a simplified-but-practical example of how Node.js uses evented I/O to achieve improved concurrent performance, using a business-domain application (a report-writer). Did you not understand how it works? I'll try to explain in more detail, if required. -DavidMcLean

That's not a UseCase. There are ways to run multiple threads without having to use (exposed) HOF's on clients and/or servers. You haven't ruled those out. Why are they "bad"? And cranking up the number of threads if the bottleneck is the RDBMS will do us no good.

Because they're multiple threads in the first place, which raises concerns of thread safety, race conditions, and so on. Evented I/O is usually single-threaded (Node is), making it simpler to work with. The preceding description of how Node's eventing system works may be worth another read; if you're still equating it with multi-threading, you haven't really got the basic concept. And the bottleneck 'isn't the RDBMS, as we've explained: It's local app code waiting' for the database.

And the fact that these concurrency systems use explicit anonymous functions isn't a weakness. It's a strength, because functions are very easy to manipulate to do cleverer stuff. For example, retrieving two database queries in parallel to use in one report, a possibility I mentioned above, would be rather convoluted and messy using pure callbacks: You'd need to code up some referencing-counting junk and it'd be annoying. However, because higher-order functions are so general and flexible, you can write libraries to wrap up these sorts of concurrency patterns. In fact, if I wanted to implement the above two-query thing, I wouldn't even 'consider' writing the callback structure myself manually. I'd just load up the async library and do this:

async.parallel({
users: makeQuery("SELECT * FROM users"),
posts: makeQuery("SELECT * FROM posts")
}, function(err, results) {
var users = results.users;
var posts = results.posts;
// can do whatever you want with these two now
res.write(aReportMadeUsing(users, posts));
res.end();
});

Bam. Two queries performed in parallel, used to construct one report. Tidy and intuitive. It'd be impossible to provide nice libraries like async.js if Node's concurrency didn't use handy things like higher-order functions. -DavidMcLean

Usually an SQL JOIN or UNION is done to "combine two queries". The RDBMS can potentially parallelize multiple sub-queries. Also, multiple different techniques can implement a parallel "makeQuery" function. Bam! Granted many existing web frameworks and languages don't make doing such very easy, but that's likely because the need is not very common. The few times I can recall when I couldn't use JOIN or UNION to get the database to do it, the queries had "lopsided" profiles such that parallelizing them would not double the speed. For example, one may take 500 milliseconds and the other take 50 milliseconds. A non-parallel version would then take 550 ms and the parallel version would take 500 ms (under ideal conditions). That's hardly enough savings to bother in most cases. Optimizing the graphics on the page would probably give the app more of a boost per time spent, and keep the code simpler. Further, if the server is taxed, it may not be able to parallelize them anyhow, and/or they could end up competing for the same resource, such as disk or network I/O such that they end up waiting on each other anyhow. A lot of circumstances would have to line up just right to get a noticeable boost. If you look at the math in context of real systems and real bottlenecks, parallelism is often over-rated for CBA. I vaguely remember one profiling expert saying that as a rule of thumb, in production you get about 20% to 40% of the theoretical maximum of the savings. Thus, if "unstacking" two queries of the same size could in theory boost the speed from 2000 ms to 1000 ms, then the typical actual average would be something like 1700 ms (1000 + (1000 - 30% * 1000)). Spending that unstacking time tweaking with the query statements or indexes may give more speed per programmer time.

And databases are starting to work parallelism into their Stored Procedure languages. See also example "fern01" later.

makeQuery() isn't a parallel function. It'd be defined like this:

function makeQuery(sql) {
return function(callback) {
db.query(sql, callback);
}
}

And you still seem to be assuming the bottleneck is the database itself. It's not. The app code that has to wait for database results is. When you use evented I/O like Node, your app code doesn't have to wait for database results. Therefore that bottleneck is reduced. It's fairly simple, really. -DavidMcLean

What else is it going to do during that time? If the user asks for Report X by pressing the Report X button, then the app has to run the necessary query(s) for Report X before delivering Report X to the user. Thus, either the user waits for the database to complete its job, or pressing Report X does something else besides (in addition to) deliver Report X, which would make the button a liar. Thus, it's either Lie or Wait. There is no 3rd option known to mankind. Maybe it can run Seti@Home while waiting so that aliens can answer that difficult question. (Seti@Home is a different app, but maybe you mix and match in weird ways such that your vision of "application" differs greatly from mine. It reminds me of the old joke: "The Emacs operating system needs a better editor.")

Example: Frame-mania
-----------------
  1. [Run Report A]
-----------------
  1. Report B is finished. [View]
-----------------
  1. Report C is running. [Cancel]
-----------------
  1. [Run Report D]
-----------------
  1. [Run Report E]
-----------------
  1. Report F is running. [Cancel]
-----------------
  1. [Run Report G]
-----------------
  1. Report H is finished. [View]
results = db.query("SOME QUERY HERE");
print("DEBUG: ran query");
buildReportWith(results);
db.query("SOME QUERY HERE", function(results) {
buildReportWith(results);
});
print("DEBUG: ran query");