Text

Why OAuth Specification 1.0 is a quite big security flaw / IETF get moving!!!

I actually have been a big fan of OAuth and its usage, until I didn’t read how it works, yesterday!

Let me say that, while offering some security, OAuth is not safe for work at all!

Read the article about at http://hueniverse.com/2009/04/explaining-the-oauth-session-fixation-attack/

This is not a small security hole, it’s a flaw by design!

So I read, that OAuth 2.0 is the next upcoming version of OAuth and already implemented e.g. by Facebooks Graph API and Github. However the draft is still under development.

UPDATE: After having read the OAuth v2.0 specification as far as I can see the problem has been resolved, cause the spec. requires now TLS for authorization requests and generates the token on the server and not on the client anymore, which however could be sniffed, if it wasn’t for the TLS mechanism.

So what’s the issue with OAuth 1.0 all about?

How OAuth authorization is working:

A consumer makes an authorization request to get access to the users data. (Actually he gets authorized to see some of the users data, not all). To enable this, the consumer therefore receives a so called request token, which then gets transformed in an access token + an access secret (all depending on the request token), which is not exchanged securely or over an encrypted channel, but just appended to the URL.

How it should be working:

The request token secret should be transmitted over a secure (encrypted with public key of provider) and authenticated (both sides know who is who) connection (similar to how the SSH Handshake works).

In fact the session fixation attack uses this flaw, in that:

In the authorization workflow, the user authenticates (“signs”) a request secret generated on the client and appended to a request URL, to be authenticated to access users data, which is ok under normal conditions, but the problem is, that an attacker can read in plain text, what the request token secret is (cause its not transmitted safely).

Substantially OAuth 1.0 is giving the master secret ~ request token to everyone.

Now even if a server developer limits the requests to only originate from the client(=consumer) an attacker can still hijack this mechanism by making a request “over” the consumer by using the request token and thus giving him access to the result of the request! 

Note: although the request token is not the shared secret it directly “leads” to it, thus making it “equivalent” in many ways to the master key. Of course an attacker still has to go through the consumer site, which will do the calculations necessary to access the users data for him (request->access token secret) + signing (=hashing over consumer key and access token secrets), cause the attacker has none of them, and you could built infrastructure, thats lets this be done only by authenticated consumers. But even if consumers are authenticated, as you see, you can still go “through” the consumer and get out the users data!

Even worse with IP and domain poisoning, an attacker can trick the whole machinery (e.g. the service provider) into believing, he is the consumer and do the whole access key + hash calculations himself, thus even getting out access keys, etc.

This attack just wouldn’t work, if consumer and provider were authenticated to each other over a secure HTTPS channel.

Text

Google Percolator, Core data, CouchDB and the real time indexing chain

It was really a great read today, to see how Google re-vamps its search infrastructure to get even more real time search results.

http://www.theregister.co.uk/2010/09/24/google_percolator/

I think Googles instant search together with Percolator is intended to offer realtime updates to Googles index.

A Google bot makes such trigger happen (being effectively a pull request) and then pushes that through their Percolator pipeline so that results show up in the index nearly immediately.

This is needed for data, where real time is important like news, twitter feeds etc. 

I think the more elaborated information becomes the less important real time gets!

However it great to see Google innovate and always be one step ahead to its competitors.

The problem is, that if small portions in a document change, especially with regard to links to other web-pages, Map Reduce has to process the whole data set again, to update link counts in other documents for Google’s Pagerank algorithm. In fact such link counts represent some sort of “dynamic” data in the database, thus making them not get updated automatically by the MapReduce algorithm.

In fact what’s happening here, is that, some objects in the data pool, contain “aggregate” data about other objects.

The way Percolator does these updates, by using, what they call “observers” reminds me much of how you can subscribe to update events in Cocoa’s Core Data using Key Value Observing mechanisms.

In fact immagine you have an application with a departments and employees table. Lets say, you want the number of employees in a department be stored as a property of the department. In Core data you can subscribe the departments attribute, to get notified, when the employees change, thus automatically updating also the departments table.

What Google did here, however is make this step completely integrated on a database level, which is really where it belongs to be!

Above is particularily interesting, as I was thinking of using CouchDB for my upcoming project, which will use real time data and was interested the limitations of Map Reduce Tables described in above article are generic to every Database and therefore also exist also in CouchDB.

The short answer is unfortunately: YES!

There is no database (RDBM or document based) I know of, which doesn’t have this limitation. (Correct me if I am wrong in the comments section)

So first lets take a look at CouchDB’s implementation of a Map reduce algorithm:

I read the CouchDB View Index Updates Implementaion, and found out, that it is important to note, that views are updated lazily and incrementally.

How CouchDb works:

CouchDB substantially is in first just a big set of unrelated data (documents) on which it imposes so called “views” which are B+Trees upon such data. The B+Trees are actually the efficient way to access and search the data. They are “paths” into the data. You can lookup data from different point of views (hence the term “view”) E.g. by location, time,etc.

Now if new documents get added to CouchDB the index gets updated incrementally, looking first for changes to the DB , that arrived since the last View calculation and updating the B+Tree.

Now does this mean, that our data will get updated automatically.

No, this just means, that the B+Tree gets updated, the documents themselves remain unmodified! 

This means, that also CouchDB has this limitations.

I am really looking forward seeing someone implement Percolator in CouchDB!

Text

YQL website scraping with pagination

Recently I wrote about pagination and why the way it is implemented in major websites, represents a major UI design mistake.

While the idea of limiting the results per page is a good one, thus greatly reducing load times and server load, it makes websites inefficient and complicated to use.

Today I want to talk about a further drawback of pagination, especially when it comes to retreiving information from websites, that don’t offer RSS feeds or any other form of machine readable data. The process of retreiving such information is called: website scraping.

The use case:

I recently needed to get the email addresses of all doctors in my area. While public yellow pages exist that offer such information, they all present their results in form of about 10 results per page. 

This is obviously a problem if you have more than hundred results and each one has the email address on a separate details page.

Well, if you are like me and don’t want to waste your time going through each of them one by one, but instead prefer spending your precious time on finding out how to get this stupid job done by a computer, then hold on. Here is the solution:

The solution:

This is where YQL comes into play.

If you don’t know about YQL yet, YQL is a free webservice offered by Yahoo, which substantially lets you scrape any website, with a common SQL-like query syntax and using Yahoo’s botnet to get your data;)

Lets look at a simple query statement.

SELECT * FROM html WHERE url = ‘http://www.amazon.de’ AND xpath=’descendant-or-self::title | descendant-or-self::meta’

This returns all title and meta elements in the amazon.de website. You can try this query out here.

What make YQL even cooler, is that you can create your own tables, which is what YAHOO calls open data  tables. These tables, can run custom functions on Yahoo’s servers to make the queries even more sophisticated. Immagine it like being some sort of stored procedure for YQL. The cool thing about this is, that you can write such functions in Javascript, then upload them to your server and Yahoo will do the rest for you. This makes this tool a super-potent-nuclear-weapon scraper.

This is what I needed to get the pagination under control:

First lets see how a query looks like:

select * from {table} where url="http://a-paginated-website.com/page1" AND xpath="//table[@class='results']" and paginationXPath="//li[@class='next']"

You first select the custom table (in my case the table is called otb. This is defined in the xml file that describes the table.). Then you pass in the following parameters to select:

  • url = the url of the first page you want to scrape
  • xpath = an xpath describing the content you are interested in (if you xpath is uncommon to you check out the great pages about xpath at w3schools.com
  • paginationXPath = is the xpath leading to the pagination link on the website

Now lets have a look at the javascript for the YQL table:

var paginationUrl = url;
var i=0;
var responseData= [];
var data;
while (true){

	data = y.rest(paginationUrl).accept('text/html').get().response;
	responseData=responseData+y.xpath(data,xpath);
	
	var x= y.xpath(data,paginationXPath);
	if(x.length()!=0){
		paginationUrl =y.xmlToJson(y.tidy(x)).html.body.a[0].href;
		y.log(paginationUrl);
		if (paginationUrl.indexOf('http://')!=-1){
			continue;	
		}else{y.log('not a link');}//but continue

	}else{
		var XMLresponse= y.tidy(responseData).body.a;
		response.object = {XMLresponse};
		break;
	}
}

What this does is the following: 

  1. starts an endless loop
  2. makes a request to the url provided in the select statement
  3. gets the content selected by the xpath and concatenates it with any result from previous runs
  4. gets the pagination link specified by paginationXPath
  5. does some logic to ensure, that we actually found a valid link to the next page
  6. loops through all pages that have a valid ‘next’ link
  7. when there is no further link, writes all data in the response object needed by YQL to return results

That’s it! We can now scrape through hundreds and hundreds of pages … and get back beautiful JSON!

You can download the full definition of the data table here and a run a full query here! Have fun;)

P.S. there are some considerations you should keep in mind, while using this

  • YQL spiders run under the name “Yahoo Pipes 2.0”
  • when a request is made, YQL looks in the robots.txt file on the server to see if scraping of the website hasn’t been disabled. If it is disabled, YQL will not scrape the page!
  • YQL’s timeout is set to 30000 ms. If your query exceeds this timelimit it will get aborted. I have no idea, if it is possible to circumvent this!
Text

UI Design: Modal dialogs and Lightboxes: A common UI-Design mistake in websites and some workarounds

I wrote about pagination as a common UI design mistake in my last post. This time I want to talk about another common design mistake: Modal Dialogs a.k.a. Popups.

Modal dialogs (e.g. lightboxes)

One common design mistake in webdesign and UI design are modal popups done with DHTML techniques. The most common being some form of lightboxslimbox, etc.

Whats wrong about popup techniques?

Popups are always obtrousive, as they overlay the user’s screen and the user’s interaction with the page. They are a clear statement to the user.
The statement is: “I am not part of this website!”


I personally know just one good reason, when to use this technique and it’s, when you actually want to tell the user, that something is not part of your website.
For example if you have to bring the corporate identity of a 3rd party website to your website.  An example: Let’s immagine you have to authenticate your user with a common authentication service (E.G. Facebook Connect OR Twitter) or for a 3rd party payment service (e.g. Paypal). In this situation you want to give a clear statement to the user: “Hey this is the official Facebook site!” or “You are now redirected to Paypal!”.

In all other cases there are other, much more effective techniques to use. Some of them being fixed or slideout panels, menus, tabs, accordion panels, etc. I personally like everything that fades in and out with come cool animation, but that’s individual. What is more important is picking the right size of your panel, it can even be fullscreen!

The most common mistake I see is, when websites want to display images or videos in a gallery. 

Normally websites display small versions of images somewhere within the website and when the user clicks on such an image, present a larger version of the image to the user. And where do they show it? In a lightbox! Don’t do that! This is a complete fail for 2 reasons:

Not only does a lightbox waste precious screen estate, which could be used to display details of an image, it also makes the rest of the page become unusable. Moreover many Lightbox implementation just close, when you click on a small close button, whereas they should close as soon as the user clicks somewhere outside the dialog!

However, what I want to say, is that lightboxes have nearly no benefit at all.
Your user wanted to see the details of an image and you just showed him half, while disabling the whole page!

Remember: If a user clicks on a detail, he wants the page and controls out of the way and just see as much of a picture/video as possible. So you want to give the picture or video as much space as possible or as much as the user has available and display it to him in fullscreen mode.
UPDATE: Sideways jQuery Plugin is a good example of how to do image galleries right.


There is one exception to this, which is Previews. Previews work differently, in that the user, wants a big representation, but wants to be able to switch back to the website immediately. But again you won’t display previews in a Lightbox, because, when the user displays a preview, he wants to keep control over the rest of the page , so one good way to display previews is in a non-modal, tooltip-like fashion and not in a lightbox!

The worst things you can do with modal popups:


The worst things you can do with lightboxes,  is putting them into scrollviews or scrolling divs. Most interestingly even Google makes this mistake in its well designed Google AdWords page.
Google Adwords Bad UI Design
Huh? Where are the controls here? How do you close this window?
Google Adwords 

Ahh, the close button scrolled of the page ;)

Please: Don’t use modal dialogs in scrollviews! Instead use full screen centered dialogs and if not all controls fit in the window, don’t make the dialog scroll but instead give scrollbars to the dialog and make the close button fixed on screen, so it cannot scroll off the screen like in the above example! 

Make the web a better place, by adopting these techniques!

Text

UI Design: Pagination: A common UI-Design mistake in websites and some workarounds

Today I would like to introduce to you a very common UI design mistake, still encountered in many major websites today.


Pagination:

While pagination itsself is a good idea and actually a very common technique to save load time and reduce server load, the way it’s implemented in many sites nowadays is just not user friendly at all.

For anyone who doesn’t know, what pagination on websites is, here is an example:

Google Pagination Links

Pagination is a way of providing a limit and an offset  to the number of items fetched e.g. from a database and presented  to the user.

The problem with pagination is, that the user has to reload the whole page, thus having to deal with long response times waiting for new data and doing nothing. This completely breaks the user experience, which should be much more fluid and dynamic.

So how can we do pagination right?

A common technique used today to do pagination is the AJAX lazy loading technique. (I will explain this technique in detail in another blog entry so stay tuned;)). Such technique is applied e.g. by Tumblr.com to load Blog entries.

While I don’t see any good reason for webdesigners today not to use such techniques, they are still not very widely adopted. The AJAX and DOM manipulation needed is actually available nearly everywhere, even in very old browsers and even if users have Javascript turned off, sites can provide fallbacks to classical pagination links.


Most notably even big sites like Google and Amazon use old pagionation techniques. But this doesn’t make them more acceptable! Instead it’s just a sign of how old these pages actually are and how slow innovation is moving towards…

Fortunately there are workarounds available in form of browser extensions:
One of them is called AutoPagerize, and is written by a Japanese guy. (Fortunately Google Chrome has the ability to translate Japanese pages into something I can read and understand too; I really love to read the twitter feed of this guy in Japanese!!)

The extension is available for Chrome/Safari and Firefox.

With the extension installed you can now do AJAX pagination in all major sites, including Google, Amazon, etc.

It’s one of the rare extensions, that after having them installed you get this “Man, how could I live without it till now?” feeling. You “feel” the difference! Well, that’s what good UI design is about after all!

The way the extension works, is, that it comes with a preset of so called “SiteLinks”. SiteLinks are XPATH descriptions, which for a certain page define, where to find pagination links.

What’s even cooler is, that you can provide further SiteLinks to AutoPagerizer over the WeData.net Wiki. WeData.net also provides an API to post new SiteLinks to it in form of key/value pairs. So if a page has no pagination info available yet, why not write one of your own?  


Text

Inspecting cookies using Web Inspector’s / Firebug’s console

Ever wanted to know, which cookies are set for your domain?

Modern Browsers often have the possibility to see the cookies saved on your system.

Google Chrome even has a very detailed list under “Preferences / Show Cookies and other data for websites”, which also gives you the possibility to see if a site uses HTML5 LocalStorage.

However there is a simpler trick: Just open up Web Inspector in Google Chrome /Safari or  Firebug in Firefox and type “document.cookie” in the console window.
This will show you all  cookies for a given website.

Cool isn’t it?

Text

Let’s start this blog

function foo() {
  var me= {
    id: dotmaster,
    age: 35,
    nationality: "Austria",
    twitter: "www.twitter.com/dotmaster"
    properties: [
      "Enterpreneur",
      "Manager",
      "AJAX Geek",
      "Technologist",
      "Awesome Coder;)"  
    ]
  };
  return "Hello World!";
}

Text

Guide to UI Design for Google TV: How Google TV Apps will introduce a whole new world of HTML5 and AJAX apps for TV

I just kept thinking today, of how Google TV will actually revolutionize the TV market and came to the conclusion, that this will end up in something similar to what we are seeing happening on the iPhone.

I think we will see a whole new bunch of AJAX geekiness come together with Google TV and their concept of enhancing UI with Apps for TV.

Have a look at: http://www.google.com/tv/spotlight.html just to see a few concepts, of what we can expect to come up next.

I thought of how UI design for a TV might have to look like and came up with these few principals:

1. TV UI Design Principal No.1: Apps will have to split UI and control: Control may reside on a mobile device with touchscreen or computer, UI will be the television screen

2. TV UI Design Principal No.2: The main screen being the Television, Apps will have much “screen real estate”, while having to provide much bigger buttons, than websites on a computer, cause TV screens are normally far away. (yeah I see Web 2.0 Buttons coming to the TV next you ;)

This means, that we will see many Apps, that are more similar to how they look on an iPad today, than on the iPhone.

3. TV UI Design Principal No.3: Rich remote controls: I think interacting with the TV, will be much similar to how you use the iTunes remote App, the remote control being a fully fledged web app too, with possibilities to search, select and switch TV programs as well as bring parts of the UI to the remote control itsself.

In fact the TV will be like the extended arm of your mobile device.

Use your Facebook app, as long as you are not in front of the TV. As soon as you get near your TV, connect to it and see the whole navigation happen on your TV, while still controlling the Facebook app on your phone. I really do like that thought!

4. The API

I think, there will be some SDK necessary, for bringing up some important features into what the browser is lacking right now. Especially access to the TV and Audio device!

Wouldn’t be surprised  to see this integrated in Google Tv’s version of “Webkit” ?! already.

Hopefully Google will choose to make Google TV an open platform, like they did with Android!

UPDATE: so Google TV is actually based on Android according to this article http://phandroid.com/2010/10/21/google-tv-root-first-signs-possibilities/ and the video insoide.


So while I am still waiting and am curious of how this device will really be, I recommend to you developers out there, to warm up your fingers and get ready to start doing some great stuff with Google TV!