Atomic Bank Balance Transfer with CouchDB
March 13, 2014 at 10:03 PM | Uncategorized | View CommentsGoogling around the other day I was disappointed to find that the internet has a few incorrect examples of how atomic bank account transfers can be implemented with CouchDB... but I wasn't able to find any correct examples.
So here it is: the internet's first 100% complete and correct implementation of the classic "atomic bank balance transfer problem" in CouchDB.
First, a brief recap of the problem: how can a banking system which allows money to be transfered between accounts be designed so that there are no race conditions which might leave invalid or nonsensical balances?
There are a few parts to this problem:
First: the transaction log. Instead of storing an account's balance in a single record or document — {"account": "Dave", "balance": 100} — the account's balance is calculated by summing up all the credits and debits to that account. These credits and debits are stored in a transaction log, which might look something like this:
{"from": "Dave", "to": "Alex", "amount": 50} {"from": "Alex", "to": "Jane", "amount": 25}
And the CouchDB map-reduce functions to calculate the balance could look something like this:
POST /transactions/balances { "map": function(txn) { emit(txn.from, txn.amount * -1); emit(txn.to, txn.amount); }, "reduce": function(keys, values) { return sum(values); } }
For completeness, here is the list of balances:
GET /transactions/balances { "rows": [ { "key" : "Alex", "value" : 25 }, { "key" : "Dave", "value" : -50 }, { "key" : "Jane", "value" : 25 } ], ... }
But this leaves the obvious question: how are errors handled? What happens if someone tries to make a transfer larger than their balance?
With CouchDB (and similar databases) this sort of business logic and error handling must be implemented at the application level. Naively, such a function might look like this:
def transfer(from_acct, to_acct, amount): txn_id = db.post("transactions", {"from": from_acct, "to": to_acct, "amount": amount}) if db.get("transactions/balances") < 0: db.delete("transactions/" + txn_id) raise InsufficientFunds()
But notice that if the application crashes between inserting the transaction and checking the updated balances the database will be left in an inconsistent state: the sender may be left with a negative balance, and the recipient with money that didn't previously exist:
// Initial balances: Alex: 25, Jane: 25 db.post("transactions", {"from": "Alex", "To": "Jane", "amount": 50} // Current balances: Alex: -25, Jane: 75
How can this be fixed?
To make sure the system is never in an inconsistent state, two pieces of information need to be added to each transaction:
- The time the transaction was created (to ensure that there is a strict total ordering of transactions), and
- A status — whether or not the transaction was successful.
There will also need to be two views — one which returns an account's available balance (ie, the sum of all the "successful" transactions), and another which returns the oldest "pending" transaction:
POST /transactions/balance-available { "map": function(txn) { if (txn.status == "successful") { emit(txn.from, txn.amount * -1); emit(txn.to, txn.amount); } }, "reduce": function(keys, values) { return sum(values); } } POST /transactions/oldest-pending { "map": function(txn) { if (txn.status == "pending") { emit(txn._id, txn); } }, "reduce": function(keys, values) { var oldest = values[0]; values.forEach(function(txn) { if (txn.timestamp < oldest) { oldest = txn; } }); return oldest; } }
List of transfers might now look something like this:
{"from": "Alex", "to": "Dave", "amount": 100, "timestamp": 50, "status": "successful"} {"from": "Dave", "to": "Jane", "amount": 200, "timestamp": 60, "status": "pending"}
Next, the application will need to have a function which can resolve transactions by checking each pending transaction in order to verify that it is valid, then updating its status from "pending" to either "successful" or "rejected":
def resolve_transactions(target_timestamp): """ Resolves all transactions up to and including the transaction with timestamp ``target_timestamp``. """ while True: # Get the oldest transaction which is still pending txn = db.get("transactions/oldest-pending") if txn.timestamp > target_timestamp: # Stop once all of the transactions up until the one we're # interested in have been resolved. break # Then check to see if that transaction is valid if db.get("transactions/available-balance", id=txn.from) >= txn.amount: status = "successful" else: status = "rejected" # Then update the status of that transaction. Note that CouchDB # will check the "_rev" field, only performing the update if the # transaction hasn't already been updated. txn.status = status couch.put(txn)
Finally, the application code for correctly performing a transfer:
def transfer(from_acct, to_acct, amount): timestamp = time.time() txn = db.post("transactions", { "from": from_acct, "to": to_acct, "amount": amount, "status": "pending", "timestamp": timestamp, }) resolve_transactions(timestamp) txn = couch.get("transactions/" + txn._id) if txn_status == "rejected": raise InsufficientFunds()
A couple of notes:
- For the sake of brevity, this specific implementation assumes some amount of atomicity in CouchDB's map-reduce. Updating the code so it does not rely on that assumption is left as an exercise to the reader.
- Master/master replication or CouchDB's document sync have not been taken into consideration. Master/master replication and sync make this problem significantly more difficult.
- In a real system, using time() might result in collisions, so using something with a bit more entropy might be a good idea; maybe "%s-%s" %(time(), uuid()), or using the document's _id in the ordering. Including the time is not strictly necessary, but it helps maintain a logical if multiple requests come in at about the same time.
To deploy my application, I use a collection of bash scripts known lovingly as explode. At its core, explode uses rsync to copy my source tree exactly as it appears on my laptop up to the production server:
project="myproject" src_dir="~/code/myproject" target_dir="~/myproject-deploy" ssh "$server" " /etc/init.d/$project stop; cd $target_dir; git commit -am 'pre-deploy commit'; " rsync "$src_dir/" "$server:$target_dir/code/" ssh "$server" " cd $target_dir; git commit -am 'post-deploy commit'; /etc/init.d/$project start; "
Now, at this point you're probably thinking:
- Doesn't that mean that your deploy could contain files which aren't in source control?
- What if you forget to pull before you deploy?
And those are definitely real problems.
But I've found that the popular alternative, deploying from the project's source control repository, has two significant drawbacks when deployments to environments other than production (staging, testing, demos, etc) are considered:
- Pushing quick, experimental changes is annoying when a full commit to source control is required. You can often tell when a project deploys from source control because the commit log will contain clusters of commits with messages like "trying foo", "nope, that didn't work, trying bar", ... etc.
- If the deploy script doesn't verify that there are no uncommitted changes to the local source code before deploying, the deployed code will be different from the local code... And if it does, it can make quick deploys frustrating.
Obviously these are not insurmountable problems, but I've found it easier to base my deployments on rsync and add sanity checks around production deployments ("working tree contains uncommitted changes... are you sure you want to deploy?") than it is to base my deployments on source control and add workarounds for deployments to non-production environments.
Now, to be clear: deplying with rsync isn't the be-all-and-end-all... It definitely has problems: rsyncing a source tree is significantly (significantly!) slower than a git push, and it's no where near as reproduceable as pulling from master. But it works well for me :)
Oh, and the git commit that's part of the deployment? While not necessarily related to rsync, I think it's worth mentioning too: the deployment's git repository contains the entire environment, including the Pyhton virtual environment! So if, for just about any reason, a deploy fails, git checkout HEAD^ will restore the deployment to a working state (the repository doesn't contain logs, the database, user data, though... As much fun as that would be, it would also be slightly impartial).
tl;dr: rsync makes it super easy to deploy to non-production environments, and deploying to production can be reasonably safe if a few sanity checks are performed.
In case of Wikimergency (removing Wikipedia's blackout)
January 18, 2012 at 02:06 AM | Uncategorized | View CommentsHere's a bookmarklet that will remove the SOPA blackout overlay from Wikipedia:
javascript:(function(){document.getElementById("mw-sopaOverlay").style.display = "none"; var ids = ["mw-page-base", "mw-head-base", "content", "mw-head", "mw-panel", "footer"]; for (var i = 0; i < ids.length; i += 1) document.getElementById(ids[i]).style.display = "block";})()
A curious thing I've noticed recently is that a large number of startups don't include any information about the identity of the founders/employees.
I find this strange because, in my opinion, a big advantage of working with (or using the product of) a startup is that I can get to know the person (or people) behind the scenes. They might even be someone I know, or someone a friend knows.
I would go on, but I would just be repeating what Jason Cohen says in You're a little company, now act like one.
So, if you're part of a startup: please, include your identity and the identity of the other founders/employees somewhere on your website.
I've read a small but nontrival number of resumes in my life, and each time I've vocalized strong feelings about what makes a resume good or bad.
Some people, presumably mistaking these strong feelings for expertise, have asked me for advice when writing their resume... So I will summarize here the things which I appreciate or despise while reading a resume.
Disclaimer: these are just my feelings. It's likely that people similar to me will have similar feelings, but it's unlikely that someone like me will be the first one reading your resume.
For many positions a non-technical recruiter or HR person will be the first to see your resume. They will likely apply naive pattern matching against a given set of buzzwords to determine the quality of applicants, so a section containing buzzwords and synonyms (ex, "Python, CPython, Django 1.3, Django 1.2, Django forms, Django admin, AJAX, ...") would likely be helpful (although, again, I can't speak with authority as I'm not a non-technical recruiter).
As I'm looking at a resume I'm trying to answer three questions: what does the applicant know? How well do they know it? Would an interview with this applicant be a waste of time?
Some of the things I like to see are:
A summary of skills, programming languages and tools sorted by experience. I can't count the number of times a applicant includes a list like this on their resume:
- Programming languages:
- Python, Java, C/C++, Objective-C, JavaScript, C#, Visual Basic, HTML, CSS, Perl, Scheme, ML, Prolog.
- Tools:
- Make, Vim, Emacs, Word, Power Point, Terminal, Linux, Firefox, Flash Player.
This kind of list is absolutely useless. Actually, no, that's not true. It's very useful. It tells me that the applicant doesn't feel the need to distinguish between their knowledge of Prolog and their knowledge of JavaScript, and they think C is similar enough to C++ to warrant listing them together.
A list ordered by experience is much more useful. For example, my list might look something like this:
- Programming languages:
- Python (5 years, 50k SLOC), ActionScript (2 years, 20k SLOC), JavaScript (5 years, 5k SLOC), HTML+CSS (5 years), C (2 years, 2k SLOC), some knowledge of PHP, Scheme, Haskell, and ML.
- Tools:
- Vim (6 years), Bash (4 years), POSIX shell environment and utilities (4 years), Linux (8 years, mostly Debian and server administration), Mercurial (3 years), git (1 year).
Alternatively, the headings "most experience with" and "some experience with" could be used, as they give some idea of both expertise and breadth.
It's also worth noting that, at least in students, self assessment of knowledge is useless, so when I see words like "expert" or "very good" in a resume I mentally replace them with "dangerously ignorant".

Relevant links. I feel like this should go without saying, but I would guess that only 10% of the resumes I have read included any kind of link.
I appreciate any links to a applicant's online presence: a blog or personal website (if it has been updated in the last five years) is good, and an account on GitHub, BitBucket, StackOverflow, Reddit or other social site is even better. It's hard to convey interpersonal and communication skills on a resume, but social sites are a good way for applicants to demonstrate those skills.
And a corollary which: irrelevant links (for example, to a "home page" without any information that isn't already in the resume) are worse than useless - they waste my time and make the applicant look silly.
Length. I don't care. After the summary of skills, languages, and tools I'm usually skimming for interesting things... And if the applicant has three pages of interesting things, I would like to read those three pages. But if the applicant has three boring pages, I'll just skim over them.
Relevant work experience. Specifically, specific work experience. A statement like "worked on a team" or "helped implement" isn't as helpful as "wrote server-side JavaScript" or "designed the database schema". Also, while others might, I don't care about unrelated work experience.
It's also useful to include the software/tools used at a job. For example: "MegaCorp - 2001 - 2003: Designed database schema to optimize sale transactions; performed A/B testing to optimize widget sales; software used: PostgreSQL, PHP, libABTestOMatic"
School. As a university dropout, I'm a bit biased here... But I care very little about school or degrees unless they are directly relevant to the job at hand. For example, if I the position I'm hiring for would benefit from a strong knowledge of a particular academic field, formal education in that field would be important... But, in general, I haven't found formal education to be very important for the types of programming that I do, or the types of developers I hire.
Look, feel, and little details of the resume are important too. I want a programmer who cares as much as I do about consistency, attention to detail and never, ever, under any circumstances mixing tabs and spaces. If an applicant's resume has bullets that don't line up, text in random fonts, and consistently missing punctuation, I can't help but assume that their code will look similar.
Cover letters. I have read one or two very good cover letters where the applicant gave specific reasons they were interested in the job, and why they would be worth my time to interview. They were short, to the point, and distinctly lacking in phrases such as "I would make an excellent …". I appreciated these cover letters.
I have skimmed and subsequently ignored all other cover letters.
tl;dr: I want a resume to show me in as much detail as is sensible what the applicant has done, what they know, and why they are a good programmer. Bonus points for interesting content and links to social sites like StackOverflow or Reddit.