Data Deposit Box instead of data portability

I've been ranting of late about the dangers inherent in "Data Portability" which I would like to rename as BEPSI to avoid the motherhood word "portability" for something that really has a strong dark side as well as its light side.

But it's also important to come up with an alternative. I think the best alternative may lie in what I would call a "data deposit box" (formerly "data hosting.") It's a layered system, with a data layer and an application layer on top. Instead of copying the data to the applications, bring the applications to the data.

A data deposit box approach has your personal data stored on a server chosen by you. That server's duty is not to exploit your data, but rather to protect it. That's what you're paying for. Legally, you "own" it, either directly, or in the same sense as you have legal rights when renting an apartment -- or a safety deposit box.

Your data box's job is to perform actions on your data. Rather than giving copies of your data out to a thousand companies (the Facebook and Data Portability approach) you host the data and perform actions on it, programmed by those companies who are developing useful social applications.

As such, you don't join a site like Facebook or LinkedIn. Rather, companies like those build applications and application containers which can run on your data. They don't get the data, rather they write code that works with the data and runs in a protected sandbox on your data host -- and then displays the results directly to you.

To take a simple example, imagine a social application wishes to send a message to all your friends who live within 100 miles of you. Using permission tokens provided by you, it is able to connect to your data host and ask it to create that subset of your friend network, and then e-mail a message to that subset. It never sees the friend network at all. Let's say it wants to display that subset of friends to you. It puts in the query, and then directs your web browser to embed in the page a frame fetched from the data host keyed to that operation. The data host talks directly to your browser, the application company again never sees the data. (Your data host knows you and your browser via a cookie or similar technology.) This would also apply to custom applications running on your home PC -- they could also ask the data host to perform actions.

The problem with a first level (API) approach to this is that one could never design a remote procedure call style API able to do everything that application developers can think of. The API would always lag behind the innovation, and different data hosting companies would support different generations of the API.

So I believe we may want to consider a less pure approach. In this approach, the data host runs a protected virtual machine environment, similar in nature to the Java virtual machine. This environment has complete access to the data, and can do anything with it that you want to authorize. The developers provide little applets which run on your data host and provide the functionality. Inside the virtual machine is a Capability-based security environment which precisely controls what the applets can see and do with it. In addition, data on your host is stored encrypted so that even the data host can't access it. Rather, when you enable an application to run on your data, you give it access tokens (capability handles,) which are decryption keys. When it wants to work on your data, it sends those decryption keys along with its applet, and only then is it physically able to access the data. The system should be devised so that access can be readily revoked for any party.

Now, even with this level of security, I think it would still be possible for a malicious applet developer to break the rules, and find ways to grab your data and copy it somewhere beyond your control. However, we've changed the rules of the game greatly. Currently, everybody is copying your data (and your friends' data,) just as a matter of course. That's the default. They would have to work very hard not to keep a copy. In the data hosting model, they would have to work extra hard, and maliciously, and in violation of contract, to make a copy of your data. Changing it from implicit to overt act can make all the difference.

You could have more than one data host, and even use different hosts for different personas. You could also personally run the data host if you have always-on facilities or don't need to roam to other computers. Or you might distribute your data hosting, so a computer in your own house is the data host when you are online, but a mirrored computer you rent space on elsewhere is the data host when you are roaming.

Social Graph Apps

Your database would store your own personal data, and the data your connections have decided to reveal to you. In addition, you would subscribe to a feed of changes from all friends on their data. This allows applications that just run on your immediate social network to run entirely in the data hosting server.

This approach still presents a problem, however, for social graph applications that go beyond dealing with friends to friends-of-friends and FoFoF. Fortunately, it has turned out that even FoF has spawned very few useful apps and FoFoF has spawned perhaps one or two. It may be the case that the added security of Data hosting outweighs the cost of making social graph apps harder.

Applications can traverse social graphs in one of two ways. Each user can generate an identity for an application or group of applications. and pass this to their connections. Each connection can get an ID made by combining these two identity codes. In that way, a connection can only be understood by applications that are trusted by both ends of the connection. Connections can be made public or semi-public, but applications can only understand the links between parties that have approved the application. The second way is for those who wish to use such applications to accept they must make their graph more available to those apps. The people providing social graph search would get access to large groups of data, but because this is the exception, and not the rule, they could be held to higher standards of data protection when they do this. This provides functionality like we have today but with better oversight and more isolation of the apps that truly need the data.

Trusted 3rd party "external data" apps

This architecture does make it more expensive to produce apps that act as a trusted 3rd party, allowed to know things you would not actually disclose to your contact. A typical example would be a "crush" app, where you can declare crushes which are only revealed when mutual. (There are ways to do this particular app without a 3rd party, but I am not sure that applies to the general problem.)

So it probably will come to pass that some apps will have a legitimate need to combine data over a large network of users. The goal is to make such applications the exception, rather than the rule. It is not to make it impossible to build apps that users want.

However, sites that want to keep the data can get extra scrutiny, and promise to remove it on demand.

Much simpler layers

For those who think this layered approach would be too difficult, consider that in a way, it already is used. Today, most personal data apps keep all the data in an SQL database, and the applications involve doing queries on the database, processing the results, and displaying them. The results are rarely remembered.

The layer approach simply involves putting the database under different ownership and rules from the application, and getting stricter about the not remembering. While there are applications that need to remember things, this would again become the exception, rather than the norm.

Building this infrastructure

Something must pay for this data hosting, and it generally needs to be on quality servers with good bandwidth, security and redundancy.

It would be ideal if users paid for it themselves, perhaps as part of the services they buy from their ISP, the same way they usually get web hosting. However, user-pay business models are rare on the internet today outside the ISP itself.

As such, it is expected that application providers will want a way to pay for data hosting for their users. They will want to offer it (seamlessly) to users who do not have data hosting that will serve this particular application. In the extreme case, we could end up with applications offering free hosting only for their own use, which is effectively how things are today, but even the concept of a firewall between data and application could have value.

While micropayments are not usually very useful, because of the human cost of thinking about small payments, they can make sense for corporate settlements. Data hosts not being paid by the user might accept requests that come with small micropayment tokens if this could be standardized; a basic token good enough for the CPU and bandwidth of a typical request. If this is done to keep the prices at market levels, there is no reason companies should not feel happy to pay for outsourced hosting.

There is of course a danger that data hosts would appear which offer hosting for free in exchange for the right to exploit the data. Clearly this is what we're trying to avoid, but it still could be better to have just one such company holding your data, so you can keep watch on what it does and actually try to understand its contract, than to have 100 companies do it, with no time to consider the contracts.

It actually makes sense for users to join together on shared hosts for a variety of reasons, one of which may be shared negotiation. When a new cool external data application appears, and it wants users to authorize it for everything, large data hosts can negotiate how much data the application actually needs.

Updates

Local hosting on your own PC

One interesting possible architecture is local hosting on your own PC, with sync to an external data host for roaming and exchanging updates with contacts.

In this case, You go to the social networking site, which embeds an application. The application is an iframe sourced from yourname.datahostdns.com:port or similar. When at home, this resolves to localhost (127.0.0.1) meaning your own machine, where a data hosting server is running. Your own machine's data host looks at the request, which may trigger it to connect to the application's server for new code or data, but in many cases the code will be cached locally. (The URL would include the version number of the app it specifies to save even that lookup.)

Then your own machine performs the operation and feeds back the resulting HTML to the embedded frame in your browser -- all very fast and all local.

Cloud data hosting is now much simpler and cheaper. It is used as a cloud cache. It handles feed updates with your contacts, syncs data with your personal workstations, or acts as your server when you are roaming to untrusted machines. It performs any other operations that require an always-on server, since your PC is not that. (Google's Browser Sync is a proto-example of this model.) The cloud-based host can also provide data hosting for users who can't or won't install a data hosting app on their PC, which may include most corporate users at the start. This is vastly cheaper to operate, and thus it's easier to finance this infrastructure.

This approach requires that you be able to really sandbox the social app code. Though it's interesting to consider that one reason we don't trust random code on our machines is to protect our personal data, and so we shouldn't drive all our personal data into the hands of 3rd parties in the name of protecting it. We're not talking about running random code on your sandbox though, but only code from application developers you have decided to trust. In that sense it's not much similar than running Active X controls or Applets in your browser.

Another interesting approach would be to integrate data hosting into a home router (with USB so it can connect to a large flash drive.) As long as apps aren't super CPU intensive, this could provide an always-on server that's in your house and isolated from your PCs.

Open Social

Kevin Marks writes that plans for OpenSocial involve implementing many elements of this architecture. I hope that's true, and I have more to learn about OpenSocial, since my first blush evaluation mostly saw a repurposing of the Google Gadget framework, but plans for the future do more.

More pros and cons

In this follow article on pros and cons of data hosting in the new data hosting tag.

Comments

I agree with most of your arguments, but am not sure about the conclusions - in particular I believe we essentially have most of the necessary infrastructure. Rather than thinking in terms of data being maintained in SQL DBs on or off the web, consider the web itself as the database. It really doesn't matter where the data is hosted as long as appropriate connections are in place. See http://en.wikipedia.org/wiki/Linked_Data

Because we want to lay down the duties of the people who host it for us, if we don't host it ourselves. (And most people won't host it themselves.)

If a host has just one duty, to keep the data safe and allow only authorized actions on it, then that is what the host will focus on. If the host is facebook, whose duty is to find ways to monetize its database to maximize shareholder value, you will get a different result.

That's why the layers, and why for most people, not having too many hosts. (Some might decide to have multiple hosts to maintain multiple persona, such as a personal and work persona, but that's their decision.)

Good post.

It's clear that the data portability model is limited, but I would even go beyond hosting portability to service portability, and I would separate the hosting company from the value-added service provider.

First, the hosting model still leaves you at the mercy of the host if/when you decide to move... you still have to update your address at all of the service providers who might access the services of your hosting company. A more robust model would provide service portability through service discovery. Service providers don't need to know where it the data is hosted, but rather where they can find the current host. That gives you a layer of indirection that lets you move your hosting company without needing to remember which service providers are currently relying on that host.

Second, you suggest that the hosting company's job is to perform actions on your data.

Why is that?

My DNS host doesn't perform functions on my DNS. Nor do I expect my webhost to perform actions on my website, although I do like to have a range of services I can easily install and run (such as installing WordPress through Fantastico).

I would argue that there is an inherent conflict of interest in the hosting company providing value-added services, and that in fact, what we should do is design an architecture where hosting is functionally distinct from value-added services. Any authorized value-add service provider should be able to access your data services, which leads to a cleaner architecture where companies that happen to provide both hosting and value-add services can do so with clear contracts and authorization.

Finally, there's no real need to have all of my services at the same hosting company, just like my DNS, my website, and my email can all painlessly be hosted anywhere I like. As long as the services can be discovered, there's no reason to have any individual's services centralized, nor a need to centralize many individuals' data in one place.

The collaborative/co-op type negotiating strategy you suggest can easily be implemented by a value-add service provider, independent of the source of the data. Users simply join that co-op or buying group and point the co-op to their discovery service. Flash-mobs rejoice.

All of which is to say, you are definitely going down the right path. More portability more better.

-j

Well, in the past we have used DNS as a way to name hosts and move them around. This could be used here, but not a high level domain. I might get bradsdata.datadns.com as a subdomain that I can point at whatever data host I like. However, there could be other indirection or discovery protocols.

Typically I see a main site providing your interface to social data apps. That site would embed other web pages which are served from your data host, running code provided by app providers. The DNS would direct your own browser at your own data host of choice. (In fact, this architecture allows the data host to be your own PC, if you don't need to roam. Your own pc at localhost:port would see a request for a social app window. The data host program on your PC would connect to the specified remote application's server for any code updates or special data, download them if it doesn't have them cached, execute the code and return the results to the iframe in the browser page.)

I think I'll update about that.

Now as for the central repository: This is complex. People are saying, it seems, that they don't want their data scattered around everywhere, both because of lack of control, but more commonly because the UI to give apps access to it is too complex. If we can develop a good UI so that it is easy to give apps just the data they need, and no more, then scattering can be good. The data hosting model does not dictate about scattering or centralization, but I agree that users will tend to centralize, just for ease of control. A central server contracted to me may be better than 30 servers with only loose bonds to me, such as 30 different social app companies each knowing different subsets of my data.

I really wonder how you can efficiently implement this use case:

"Show me the flickr photos that my friends and the friends of my friends faved in last 2 weeks, sorted by the total number of favs", while the photos are still stored on flickr.

I can see you can implemente that with "agent" that will crawl to the hosting sites of your friends and their friends, collect the data and come back. But that would be slow, especially if some of your friends host their data in PCs that are currently offline (remember The Eight Fallacies of Distributed Computing?). Do you have any solution for that in mind?

BTW, creating a Virtual Machine you still provide API to the applications. But it is quite broad, which means it is difficult to control security.

As I've noted, FoF apps turn out to be much less interesting than people thought at first. Do you really look at the photos of your FoFs? The main FoF app that seems to be useful is LinkedIn's "search your network" which can answer questions like, "Who can I contact at Company X" and dating introductions. FoFoF turns out to be surprisingly non-useful.

However, I won't proclaim that nobody can think of useful apps (or simply entertaining) apps here. So there need to be solutions, even if it turns out to be that those apps get access to large networks, but are the only ones that do. (Remember, our alternative today is zillions of apps getting access to this data.)

I don't expect home PCs to be required here. Everybody wants an always on host for many functions. You don't need it in the sense that I don't think "Ask 100 hosts to search for a query" is a good implementation, but I would use the local hosts just as a way to do things efficiently for the user when the user is signed on. The cloud host would do things for others. Client data hosts and cloud data hosts would sync.

Your particular app, and other FoF apps, could be implemented, somewhat less efficiently, with data updates. That is to say, if you are using an FoF app, you would send changes to your friends, and they would forward those changes on to their friends as part of the update stream. In this case everybody is storing all the basic data (not big things like photos, just smaller stuff, including the URLs/access tokens of the photos) on their own host, and apps can operate on it. This is why it does not scale to FoFoF, but I think it could handle FoF if the updates are not large.

Of course, you, your Fs and your FoFs must be running the same application, but that is the same as saying they are all members of Flickr. To implement Flickr you need more though, and it may not even be possible to implement all apps in this manner. Still better than implementing them all in a central repository manner, though.

The W5 project at MIT is looking at ways to solve these issues:

http://pdos.csail.mit.edu/~max/docs/w5.pdf

But I haven't studied the underlying systems proposed enough to judge if they can do it. One concern I see immediately regards whether developers can be talked into it. Part of the sex appeal of web 2.0 (meaning apps in the cloud) is that developers get free reign to write and maintain their apps using whatever platforms and tools they like. They are no longer limited to even the constraints and problems of writing code for a user's PC. Users at the same time love not having to install software, having somebody else maintain it all, and having to roam.

My own proposals face this problem too. These abilities are very attractive to users and developers, and as long as they can get the functionality (which javascript has now given) they will rush to them.

It is for this reason that I have decided that some compromises will be needed, that we won't get to the level where we can run a malicious app on our data. That's because the programming hoops required to use a system that bars malicious apps may be too involved. Happy to be proven wrong, though. I would be happy just to reach the level where apps don't end up taking more data than they need, and don't end up storing copies of it.

Brad,

What you describe is what I've referred to as Data Spaces in the Clouds (Fourth Platform) for a while :-)

Yes, there is some confusion about the literal interpretation of the phrase: Data Portability (free movement of data across platforms). We don't necessarily want free movement of data across realms (unless we explicity enable it in our space). Instead, I believe we seek Open Access to our Data Spaces with access control granularity.

To conclued, we do need Data Access by Reference facilitated by portable Data Containers (Data Spaces) in the Clouds. Of course, these containers can move themselves, or data from the clouds to other locations, wholesale or via replication and synchronixation. In all cases using standard protocols and existing infrastructure such as the Internet and Web.

Links:

1. OpenLink Data Space Wikipedia Page
2. OpenLink Data Spaces (Open Source Edition) Home Page
3. My Data Space Profile Page
4. EC2 installation Guide
5. How to get a Gateway into your Data Space (i.e a URI for Your Data Space) in 5 minutes or less
6. Recent WWW2008 Presentation about Data Portability and Data Accessibility (PPT)

But it doesn't resolve -- perhaps nothing can -- the almost tautology that making it easier for outside programs to get at the data makes it easier for the data to flow outside. The challenge is how to serve two masters:

  1. Making it easy to program applications that use the data
  2. Keeping the data close to your vest

Brad,

Read your post at Scoble's blog after posting my own - your must have come in about the same time.

I had to read (ok, quickly scanned) your post here, and find you and I are very much on the same page.

I am working with a group that wants to develop what it believes is the answer - perhaps we should talk.

Please contact me at your convenience,

Allan Sabo
Alti Success Strategies
Experts at Integrating Social Media and Internet Marketing

Send me an email!

I experiment with some of these concepts in ObjectCloud, my web server that's designed like an operating system. It acts as a form of a personal data store.

One way that I could accomplish running untrusted 3rd party code is with a simple in-browser Javascript sandbox and XHTML. There could be a way for my server to grab XHTML and Javascript from an untrusted source and filter it to a subset of trusted tags and scripts. I could then pass it various APIs for limited access to the users' data. (ObjectCloud has a rich Javascript API for many kinds of data.) The biggest challenge is some way to prevent malicious Javascript from generating script tags that let it communicate with APIs outside of the sandbox; thus leaking the users' data. (I know Yahoo has done some work in this area, but I don't know if it's applicable.)

One thing I envision is that when an application starts storing data about you on your own server, they might not just store the data but offer classes which can be used to work with the data, and abstract it. In fact, one can imagine arranging it so that the only way to get at most data is through the official interfaces of the classes, with a security model that prevents other approaches. This can allow things like logging access to data and so on. I don't know if Javascript can do this, perhaps it can.

Of course, just as there are translators from other languages to JS, there are implementations of JS for the JVM, though I don't know if they have fully caught up with everything in modern server side javascript.

Overall the goal is to provide important data access and sandbox features, but otherwise restrict the language choice of coders as little as possible.

I just shared this article with the guys at the Unhosted project, figured I might as well spread the link both ways.

Unhosted (www.unhosted.org) aims to realize a somewhat simplified version of this vision, creating a standard way to separate data storage from application logic in Javascript web-apps. It's a small step, but it's in the right direction.

As I mentioned to you in an e-mail a couple of months back, I'm working on a lower segment of the stack (www.pagekite.net), trying to give people a realistic way to run globally reachable servers on personal hardware, so they can actually self-host their sites (or just data) if they so choose.

Both projects would obviously benefit from advice, testing, advocacy or all three. :-)

Though I remain convinced that to get adoption beyond the true believers, the final result must match or exceed what people get from facebook and similar sites in terms of ease of use, performance, abilities, and initially cost -- ie. free.

Down the road I suspect people will pay for their own hosting to get better that what is offered free or free-with-ads. But it must be as easy to sign up as it is to sign up for facebook, which is a high bar but one that must be overcome. Having to set up a server yourself is a non-starter.

Yes, for anything to take down Facebook et al, it has to actually be better, not just free-er.

However, I do disagree with the installing a server bit. It just has to be really easy.

People install software all the time - games, apps, networking tools like Skype. If anything, things like the Android and iPhone app stores are making this easier and more common.

The complexity of server software is not necessarily any higher than the complexity of something like a web browser or a word processing tool - making a server of some sort equally easy to install, use and keep secure/up-to-date is in no way impossible. Hard yes, but not impossible.

Arguably, companies like Skype are doing it already. They just don't tell people it's a server. :-P

The Skype trick has some merit, though some people are a bit bothered by it. In order to be superior to Facebook, it has to work from your mobile phone and laptop and desktop even when the others are off or disconnected from the net, just as Facebook and the rest do.

The question is, can you do that with the Skype supernode approach? With Skype, the calls go through the supernodes (and failed last week because of this) but they are encrypted end to end so the supernodes can't listen. If you want to rely on the servers of random people to provide hosting and processing for your data, then they could possibly look at the data and publish strange things they find -- a non-starter I think. You can't do it all with zero knowledge operations.

Add new comment