MobileMe Problems Show Apple Needs an Infrastructure Lesson

August 8 update: I’ve written some followup thoughts on this message here.

MobileMe Problems Show Apple Needs an Infrastructure Lesson – GigaOM:


Steve Jobs, in an internal email seen by Ars Technica, makes clear that he’s upset about the botched launch of MobileMe, Apple’s new online suite of applications that has been plagued with bugs, including being flat-out unavailable to some for days at a time.

Or as I have been saying to folks here at work, “just imagine Steve Jobs wandering the hall with a flame thrower in hand, asking random people ‘do you work on MobileMe?’

I expect a bunch of friends and people I know were involved in that project, and I feel really bad for them. But the reality is, the thing wasn’t ready and the release got botched. And Steve and Apple aren’t terribly tolerant of that kind of major screwup. I expect heads have rolled and there are a few tanned hides waiting for the welts to go away.


“It was a mistake to launch MobileMe at the same time as iPhone 3G, iPhone 2.0 software and the App Store,” he says. “We all had more than enough to do, and MobileMe could have been delayed without consequence.”


There are two aspects to this. Steve is absolutely right — but also remember that ultimately, it was Steve’s call to go live (or not). he’s never been afraid to say “this ain’t ready” and pull something from release; his rehearsals for MacWorld Keynotes are legendary (and sometimes brutal), and stuff literally has disappeared in the last 24 hours, if he wasn’t satisfied with it.

So Steve has some responsibility here a swell, but with a caveat: someone he depended on to tell him what reality was told him it was ready to roll, and Steve believed him. And whoever told him that was wrong, and made everyone (including Steve and Apple) look bad. That’s not a good way to advance your career at Apple.


In his email, Jobs says: “The MobileMe launch clearly demonstrates that we have more to learn about Internet services.” You can say that again. The big question in the wake of the MobileMe debacle is whether or not the company even knows how to plan for heavy load.

Or not. Gruber nails this (see below). MobileMe is a tiny thing compared to iTunes. Apple gets it, and executes it amazingly well.

That this release was botched isn’t about Apple not having a clue, but about the MobileMe people either blowing it (I can think of any number of scenarios — scaling it hard). The ultimate failure seemed to be more capacity planning mistakes than anything else, if I’m guessing right. but the ultimate failure was not being willing to tell Steve “we aren’t ready” and taking that heat. They thought they could release and make it work, and guessed very wrong (or thought they were in good shape, which is worse).


I have picked up some tidbits from my Internet infrastructure sources, who tell me that:

* There is no-unified IT plan vis-a-vis applications; each has their own set of servers, IT practices and release scenarios.
* Developers do testing, load testing and infrastructure planning, all of which is implemented by someone else.
* There’s no unified monitoring system.
* They use Oracle on Sun servers for the databases and everything has its own SAN storage. They do not use active Oracle RAC; it is all single-instance, on one box, with a secondary failover.
* Apparently they are putting web servers and app servers on the same machines, which causes performance problems.

One of my sources opined that Apple clearly wasn’t too savvy about all the progress made in infrastructure over the past few years. If this insinuation is indeed true, then there is no way Apple can get over its current spate of problems. It needs a crash course in infrastructure and Internet services. Apple’s problem is that it doesn’t seem to have recognized the fact that it’s in the business of network-enabled hardware.

Not completely true, not necessarily a bad thing.

Some areas of Apple “run their own show”, effectively using Apple’s IT datacenters as a hosting facilities. Others build and operate within Apple’s IT infrastructure. One of the groups that basically runs its own IT outside of Apple’s core IT group is Eddy Cue’s group — because of the way stuff Eddy is in charge of gets built and managed.

There are unified monitoring services — and each service also tends to run a layer above that to monitor specific details. That’s not a negative.

the Oracle/Sun single instance thing? true to a degree, but I don’t see it as a negative. And don’t forget, Apple runs the largest global single-instance SAP environment on this stuff. it’s not exactly doing things wrong.

The bottom line is — Apple’s got its act together here better than these informants want to imply. The failures aren’t because Apple doesn’t know how to do this — it does — it’s because this project got botched.

And now Eddy has been brought in to fix it, which means it’s going to get fixed.

Eddy’s name isn’t familiar to most apple people, but he’s in his way as important to apple’s success as Jonathan Ives. His specialty: the back-end infrastructures that make Apple’s online universe tick. His groups did the Apple online store, iTools (later .Mac), iTunes store, etc, etc. It’s the not-sexy part of the company, but it’s the guts that make all of the sexy front ends actually work.

I’m actually amazed that Eddy hasn’t been poached by a startup, much as I’m amazed that Tim Cook hasn’t been poached — but the reality is that if you survive and become one of Steve’s inner core of people he trusts (and that ain’t easy) — you tend to stay. Apple doesn’t generally get poached by startups or other places at the exec level often, anyone notice?

A lot of that is because it’s not easy working for Steve, but if you can do it, you get to do really great stuff, and that’s addictive. trust me. you just don’t see people running off from apple to CEO a startup the way you do Yahoo or Google, not out of the top few levels of the company.

Eddy’s real specialty is to be able to take what Steve asks for, implement it, hit the target dates, make it work, and KEEP THE DAMN THING A SECRET UNTIL STEVE ANNOUNCES IT. That’s a big reason why his team is self-contained. It also means his people can do what needs to be done to implement things that never existed before and which don’t fit into normal IT “this is how we do things” standards. he and his teams spends most of his time off in uncharted territory where a need to be innovative and flexible is a must, and yet they have to do it on huge scales.

On the other hand, Eddy’s no easier to work with than Steve is, for obvious reasons. I invariably warned people not to hire into his groups unless they wanted to donate their life to the cause. When I was there, I worked pretty closely with various parts of his world, and it was populated with equal who were just as maniacal about this as Eddy and steve and people who were in process of burning out. Not much middle ground (but it works).

(full disclosure time: Laurie worked with Eddy way back when; me, I once almost got re-orged into his world until management remembered my vow to die before working for him, and re-arranged reality to fit (otherwise, lists.apple.com never would have existed….) — but I had a chance to deal with him while I was there and I’ve got a lot more respect for him now than I used to. I still wouldn’t want to work in the kind of grind his organization demands, though, but it does pretty good work under really scary conditions.

So you can bet, MobileMe will get fixed.


The looks, UI and edge devices are only as good as the networking experience — whether it comes from Apple or from its partners. MobileMe could just be the canary in the coal mine as far as the Cupertino Kingdom is concerned. MobileMe isn’t that big a portion of their revenues right now, but what happens when the problems hit the iTunes store? Imagine the uproar when your 3G connections slow to a crawl because AT&T’s wireless backhaul can’t handle the traffic surge.

It might not be a problem of Apple’s making but the company will face the brunt of the backlash. Remember, most of us instinctively blame the device first, then curse the carrier.

Daring Fireball has the right view here:

Daring Fireball Linked List: Om Malik on MobileMe’s Infrastructure:


But the iTunes Store does gangbuster traffic and has terrific track record for uptime. The message I read from yesterday’s reorg that put MobileMe under Eddy Cue (Apple’s VP for iTunes) is that MobileMe could and should be as responsive and reliable as the iTunes Store.

Om just doesn’t know the Apple internals very well. This wasn’t Apple failing, it was one group within Apple blowing chunks. That happens — remember when Aperture was the state of the art? and now it’s fighting to catch up with Lightroom, and may simply never regain that dominance. ohwell.

Apple has the expertise; this isn’t a case of MobileMe problems crawling out into itunes, but Apple bringing the iTunes expertise into MobileMe. And having thrown Eddy Cue at the problem, that’s exactly what’s going to happen here.

You might also want to read:

  1. MobileMe renewal: Yes or no? It’s a bit pricey. The standard fee is $99US/year. I’ve got one additional email address for my wife, so tack on another ten bucks. Before...
  2. Followup on: MobileMe Problems Show Apple Needs an Infrastructure Lesson Chuqui 3.0: MobileMe Problems Show Apple Needs an Infrastructure Lesson: That this release was botched isn’t about Apple not having a clue, but about the...
  3. .Mac Morphs into MobileMe TidBITS Networking: .Mac Morphs into MobileMe: Although still costing $99 per year (with a free 60-day trial), the idea is that MobileMe is less a...
  4. Crazy Apple Rumors Site » Blog Archive » Apple Settle With Apple Corps. Crazy Apple Rumors Site » Blog Archive » Apple Settle With Apple Corps.: Apple Inc. and the Beatle’s Apple Corps announced today that they have...
  5. Problems I think Apple should fix and challenges Apple faces (Apple Post-mortem, part 9 of some number…..) Previous episodes: Part 1 Part 2 Part 3 Part 4 Over here, I suggested that my “Apple post-mortem” series was going to be eight parts,...

  • http://tewha.net Steven Fisher

    For those commenters who suggested I was wrong in not accepting the iTunes store as a gold standard for up-time, the Canadian store is actually very slow and failing about 25% of the time RIGHT NOW.

  • shirsz

    In light of your points, perhaps Steve’s email and his words are a direct lesson to everyone that if something’s not ready, sure u might get screamed at BUT that’ll be the worst of it & we can always work around the problems & find solutions, no matter what.

  • Rick

    Was Eddy in charge of dot Mac? Surely that would be a negative on any resume?

  • walter

    i have never had ONE problem with iTunes. it has always just worked for me. I was an early user of iTools > then Dot Mac . and now MobileMe. I really liked dot amc. except for backup which never worked for me. I had problems with Mobileme for a few days in that i was not able to delete some custom pages under Homepage. One email to support solved that issue. So the transition went smooth for me… But it is obvious that they will have to make improvements. It can be clunky and slow at times. Overall i think it’s at version .09 level. I look forward to version 1.5 under Cue. p.s. Why does the cal pane show the date but the Mail pane can’t show number of unread messages… come on apple guys&galls get your asses in gear and fix that ;=)

  • http://profile.typekey.com/danielodonnell/ DanO

    The engineering (human) infrastructure may need some attention too, at least inasmuch as some engineering resources need to be applied in places other than the back end infrastructure. BSM (auditing) and CAC (Common Access Card, a type of smartcard) are still broken in 10.5.x and have been broken for nearly two months now. As long as these don’t work, there is a niche market of users who cannot use machines – which means Apple does not make sales in those spaces (Federal and financial).

  • Bud

    The iTunes Billing cycle apparent Randomness mentioned in the comments? I think that is probably just matched to the randomness of ones purchase habits, when it makes sense to run invoices. Sometimes it seems they would rather give you one big invoice rather than many tiny ones. I have never heard any one complain. Its Volume Vs Scheduled billing, one way or the other it will work hand in hand in a sensible fashion.

  • Oliver

    …I must say this is the best comment on the whole MobileMe-issue I read so far…and believe me I read a couple of them…thanks for your thoughts about it and also thanks for the wonderful intro…It was just what I thought reading the lines of Steve’s mail…
    …let’s see if Eddy gets things fixed – what I believe, in total agreement with you, he (and his team of course) will – and probably also – with a better capacity planning this time – we’re going to see real push services coming back to MobileMe in the near future…
    Oliver

  • Paul Lambert

    That’s a very good guess, Matt—the “wait time” does vary by the size of the transaction, to consolidate credit card transactions.
    But my point remains: most of the iTunes store’s customer-driven transaction processing is handled by a different group.
    Eddy Cue’s iTunes team has done an amazing job in their design of the iTunes store. Their systems run very well, with very few staff, and they give publishers and content/editorial staff some of the best interfaces you could imagine, for making the store look good.
    They’re a very talented crew, and they’ve got great management to lead them, including Eddy.
    Scaling MobileMe is a real challenge: it can’t be cached. There will often be a lot more writes to MobileMe than reads, and that’s a whole new ballgame.
    I was disagreeing with Chuq’s suggestion that they already have experience with a site like MobileMe. They don’t. Can they make it work? Sure! If anyone can do it with the resources that they have, Eddy’s teams can.
    But it won’t be as simple as “reproduce what they did for the iTunes store” because they can’t just pass on writes to another group’s services.
    They haven’t yet had to design a system with responses that can’t be cached for at least a few minutes. The sheer number of writes, if MobileMe gets as large as it easily could, will be staggering. I don’t envy them the work. (Well, maybe I do. I actually love this performance scaling stuff…)
    Paul

  • Matt White

    I would actually guess that iTunes doesn’t process some card transactions immediately for an entirely different reason… If they wait a day or so for you to buy more and they bunch the transactions into one it saves them a ton in credit card processing fees. That counts for a lot with micropayments, on a percentage basis. I’ve noticed that if I buy something expensive (say, an entire album), the charge happens almost instantly. So, I wouldn’t use that as an example of poor performance on the part of the iTunes system.

  • http://tewha.net Steven Fisher

    @jbb
    I actually wasn’t referring to iPhone activation — I don’t even have one — but rather the iTunes Store in general. There were sections that were erroring out 50% of the time all weekend.

  • http://tewha.net Steven Fisher

    @tim
    I was actually just using it as an example of an event I could point to. I regularly see iTunes Store errors here. Maybe the Canadian one less stable, but I always thought they shared the same back end.

  • Paul Lambert

    Glenn, of course Apple batches transactions. But I don’t think that’s handled by Eddy Cue’s organization, and neither are the gift cards, and other OLTP backends.
    There’s certainly no policy to “sometimes hold transactions for 7 days to batch them, but other times run them through within 24 hours.” I don’t think I’m crazy to suggest that 24 hours is a sign of a policy, and 7 days is a sign of a process failure.
    To their credit, I’m pretty sure I’ve never not been charged. But the occasional long waits are far too random to be a purposeful policy.

  • http://www.focal-pointe.com Makea

    Aperture was state of the art: it was the only thing like it when it debuted.
    In fact, it was so ahead of the curve, Adobe rushed lighroom out as a public beta.
    Apple squandered it’s lead with bad decisions (poor raw processing, inflexible image storage workflow, poorly written sql queries which lead to horrible performance).
    Apple is only now turning course by fixing nearly all of the issues with version 1. But they haven’t learned their lesson. The image editing plug-in architecture sounds awesome, but in practice it current implementation has no place in a ‘workflow application’.
    I’d say both products are comparable with their own advantages.
    However, Adobe definitely has the momentum and market share right now.

  • jbb

    @Steven Fisher:
    The iPhone activation server, while accessed via the iTunes client, is entirely separate from iTunes the service. My understanding is that albert (the activation server) is not within Eddy’s purview. At least it wasn’t on July 11. That’s not to say it wouldn’t have failed even if it was — it was a lot of demand all at one time — but using that event to tar iTunes is faulty.

  • akatsuki

    Of course, the bigger issue, is why?
    I understand they want to create another subscription revenue model, but giving a bunch of web apps that don’t do much differently than the free Google world (with the exception of OTA sync) seems pointless.
    I am sure they will fix it, all the while never really figuring out that they missed out on a ton of visionary opportunities to actually innovate in the online space. I really do believe that along with games, web services is Jobs’s other big weak point.

  • Pierce Wetter

    About 2 years ago now, I interviewed with the iTMS group for a WebObjects position. I was coming down with the flu the day I interviewed, so I only did fair, so they didn’t hire me.
    Which turned out to be a good thing because at the time, it struck me that all of their questions were detail-oriented, with few system-level questions. As a WebObjects developer, you need to be aware of how your code impacts the whole system, the big picture can often be more important then the details. But being the only system thinker in a detail-oriented shop sucks.
    Ever since then as I’ve watched Apple have teething problems, I’ve thought, yep, no system guys.

  • Jesse David Hollington

    If we’re going to talk about the iPhone v2.0 launch as well, howver, remember that Apple weren’t the only ones with serious scalability problems that day.
    Rogers and Fido’s systems basically laid down and died for almost 12 hours due to an inability to keep up with the traffic.
    This was completely separate from Apple’s iTunes Activation Servers — Rogers staff couldn’t even look up people’s accounts to confirm whether they were eligible for hardware upgrades in the first place, much less actually sell them an iPhone.
    I have been working in the IT field for over 20 years now, and have managed various large-scale projects. Capacity planning and scalability are always elusive goals even for the most talented iT folks I’ve ever dealt with…. Sometimes you don’t know how much load you’re going to be able to put on something until it’s actually *there* and with new and custom-developed infrastructure you don’t always have industry-standard metrics to fall back on.
    Combined with more than a reasonable number of “Yes-men” in the industry who are willing to stick their necks out WAY farther then they should, this is often a recipe for disaster. I have seen the same debacle in 20,000-user organizations for the same basic reasons.

  • Kay

    I don’t remember Aperture ever being anywhere near state-of-the-art. I do remember a rapid fire series of price cuts when it failed to shift enough units, though.

  • tim

    @Steven Fisher
    Your type of response reminds how so few have this thing called perspective. During a massive release met with long lines – the activation software slowed down. Peoples phones got activated. Within a day or so things returned to normal. It wasn’t a debacle by any stretch of the imagination. I cannot imagine another company that could handle that type of event so well. And also – last year during the initial iPhone launch – there were issues. Phones took longer to get activated than this year. And yet – no one remembers it. iTunes works day in and day out and has a proven track record.
    @Chuqui
    Excellent balanced article.

  • Andrew

    Paul Lambert nails it, above: there are very few orgs who can really do a fat suite of fully dynamic web apps and launch to a huge audience on day 1. Everyone else got there by doing public betas or growing slowly with users or features.
    Apple tried to be the first, ever, and stumbled. They’ll get it fixed, no question about it. But it’s not trivial stuff.

  • Ulysses

    re: Credit card transaction delay – I actually like this, and wish AmazonMP3 would copy it. With Apple, after 1-2 days of purchases, I get one charge. With Amazon, I get dozens of little .89 and .99 cent purchases and it’s a lot to filter through when scanning my statement for bogus charges.
    As for Eddy, if he’s got anything to do with the incredibly small font used in iTunes reviews, he needs to fix that. We’re talking like 6pt, and it’s unchangeable. Either make it bigger, or tell the iTunes team to give us a hidden Minimum font size pref.

  • Andrew

    Paul Lambert nails it, above: there are very few orgs who can really do a fat suite of fully dynamic web apps and launch to a huge audience on day 1. Everyone else got there by doing public betas or growing slowly with users or features.
    Apple tried to be the first, ever, and stumbled. They’ll get it fixed, no question about it. But it’s not trivial stuff.

  • http://tewha.net Steven Fisher

    I find it curious that people would hold up the iTunes Store as a good example of an Apple service done well. The release of iPhone OS 2.0 was not too long ago, and the store aspect of that release was a complete debacle. That wasn’t the first time the store went down hard, and I’m sure it won’t be the last.
    But sure, it’s better than MobileMe. And it might be the best example Apple has of a service staying together. But it is definitely not a great example overall.

  • http://profile.typekey.com/glennf/ Glenn Fleishman

    I laughed out loud at your intro. That made me think of the black-oil aliens in X Files convincing everyone to meet on a bridge and then ZAPPO.

  • mark

    Paul,
    I think the credit card transaction delay example is a bad one, as it’s by design. Apple aims to gather up multiple purchases for a single credit card transaction to reduce fees.

  • Paul Lambert

    Actually, the iTunes store has pretty trivial scaling problems, compared to MobileMe.
    Sure, it gets lots of hits. But they’re read only.
    What happens when a response is “stale” by 10 minutes? Nothing. You can cache the heck out of it, and no one will care.
    Does Eddy’s team handle the credit card transactions, the SAP server, gift cards, and all the other write traffic? Or does that go to another group entirely? You know the answer to that one. :-)
    If the only writes they handle are the comments, playlist updates, purchases, etc., then that can’t be much traffic at all.
    And those aren’t handled well, by most standards. How long until a transaction is posted to someone’s credit card? I, personally, have bought music that hasn’t posted to my credit card for over 7 days. It wasn’t even Christmas, either.
    For the 99.999% case, they’re working with content that is, by definition, already backed up, and they’re serving it to people who don’t care if they get a response that is stale by a few minutes.
    MobileMe is a different beast entirely. If a change doesn’t show up immediately, a user is likely to think it has failed.
    You can’t scale that by just adding more front ends, and pushing the writes to another group.
    Paul