Tuesday, February 2, 2010

WURFL to Perl Data Script

WURFL to Perl data is an alternative method of implementing the WURFL XML Mobile Device Data file for smaller web projects that I have been working on for a while. Instead of parsing the complete WURFL XML file or loading the data to a database platform like MySQL, WURFL to Perl is an adaptation of the WURFL Data written as Perl code. That is, we use the WURFL Device Data as converted to Perl hashes and a configuration file.

This discussion assumes you have some familiarity with WURFL XML Device Detection Data. If not, please see http://wurfl.sourceforge.net/ for references and downloads of the actual WURFL XML data. Usage and adaptation is of coarse heavily dependent on your ability to do Perl programming too.

Let's Get Small


The first step is to condense WURFL Data down to just what we need for our particular project. WURLF Data is arguably the best and most complete reference for mobile device capabilities and as such, is a very large XML file. Large projects may need to have all device information readily available, but many times a project just needs certain device capabilities exposed. Below is a snippet from our configuration file which shows several configuration settings.

# Our configuration variables...
$config_vars = {

  wurfl_path => './lib/WURFL/',
  wurfl_file => './lib/WURFL/wurfl.xml',
  wurfl_status => "WURFL Status: OK!\n",
  wurfl_select => [
    'ajax', 'bearer', 'cache', 'css', 'display', 'image_format', 'markup', 'rss', 'transcoding', 'xhtml_ui'
  ],                          # WURFL groups to include
  wurfl_min_width => 169,  # Minimum resolution width to accept (>=)
  wurfl_min_xhtml => 1,    # Minimum xhtml markup level to accept (>=)
  wurfl_dump_indent => 0   # Sets 'Data::Dumper' indentation level, (0 is none)

}

The 'wurfl_select' hash key is probably the most important, as it specifies which WURFL 'group' data types will be included in the final WURFL to Perl code. I added 'wurfl_min_width' and 'wurfl_min_xhtml' as they were settings which were of interest in my projects. Others can be added here as applicable. The advantage is that data not required will be filtered away, making the Perl code smaller.

The configuration file also includes code for 'encoding' common string segments of 'user agent' text into abbreviated ones. Here is a sample:

# Sub to encode common long string segments in user agents
sub user_agent {

    # Input a user agent string
    my $ua = shift;
    
    # Remove non word characters
    $ua =~ s/_//g;
    $ua =~ s/\W+//g;

    # Encode many common long strings
    $ua =~ s/^3GSonyEricsson/3gse_/;
    $ua =~ s/^SonyEricsson/snye_/;
    $ua =~ s/^ACSNF30NE/ac31_/;
    $ua =~ s/^AUDIOVOX/audx_/;
    $ua =~ s/^Alcatel/alcl_/;
    $ua =~ s/^BlackBerry/blck_/;
    $ua =~ s/^Compal/cpl_/;
    $ua =~ s/^DoCoMo/dcm_/;
    ....
    ...
    ..
    .

We also encode device capabilites for brevity:

# Hash of abbreviations for WURFL standard capabilities
$abbrev_names = {

    'ajax' => {
        'ajax_manipulate_css' =>  'css',
        'ajax_manipulate_dom' =>  'dom',
        'ajax_support_getelementbyid' => 'byid',
        'ajax_support_inner_html' =>  'inner',
        'ajax_support_javascript' =>  'js',
 'ajax_support_events' =>  'events',
 'ajax_support_event_listener' => 'listen',
        'ajax_xhr_type' =>   'xhr'
    },

    'bearer' => {
 'max_data_rate' =>  'rate'
    },
    ....
    ...
    ..
    .


The configuration file is important because we use it to define and then access our WURFL to Perl data code. My 'perl_wurfl.cgi' script is used to create your version of WURFL to Perl code, base on your configuration settings. The 'perl_wurfl.cgi' script will massage and manipulate the latest WURFL XML Data file used, encoding common strings, filtering and condensing it down. The included finished WURFL to Perl code example files (from our download) show a reduction of approximately 10:1.

See is Believing



There is a lot going on here and the best way to see it is to install and run the 'perl_wurfl.cgi' script. Watch as it steps through each part of the process and you will get a feel for how it works and how much it is doing. Please note: the 'perl_wurfl.cgi' script will be very slow at first because of the 14+Mb size of the WURFL XML data file. So be patient while it is runnning. It also likes lots of memory ;)



The 'perl_wurfl.cgi' script does some fancy Perl comparisons and produces shortened 'unique' encoded user agent based hash keys. This is the reason your application will need to use the same configuration file as the 'perl_wurfl.cgi' script. From these Perl hash keys, we seek a 'best match' to the incoming user agent of the website visitor. This all works reasonably well because we use a combination of brute force searches (Regex) and textual data comparison.




From this point on, our process works much the same as other WURFL applications. We roll up the capabilities by combining each successive 'fall back' user agent (device id in WURFL terms) and provide a best guess of it's current browser capabilities in Perl hash form. The difference is, we use textual data comparison in the final device detection stage  instead of simple user agent string matching. This provides better 'new device' detection for new user agents that haven't been incorporated into the WURFL data yet.

Give it a Try



Also included in my download is the 'site_wurfl.cgi' script. This is actually the more important of the two. My 'site_wurfl.cgi' script demo's our WURFL to Perl data capabilities, while giving a realistic example of how it might be integrated into other Perl scripts. The 'perl_wurfl.cgi' script is only run when you download and update to the latest WURFL XML Data file. Thus 'perl_wurfl.cgi' is just a 'helper' script and should not be given any website user access of any kind. I update locally (yes my 4+ year old Toshiba laptop runs it just fine), which is the safest way to go.


Use the included 'site_wurfl.cgi' script as a guide to how we access our WURFL to Perl code. It is also a quick and convenient way to test configuration changes or new devices against current WURFL to Perl code. Or just use a 'User Agent Switcher' in your favorite browser and watch the process at work.


For more info, please see:


Demo the 'site_wurfl.cgi' script: Site-WURFL
Download Site-WURFL w/ WURFL-Perl code: http://code.google.com/p/mobilesiteos/
See WURFL to Perl in action on our website: OpenSiteMobile



Monday, December 7, 2009

How to Get Character Widths for a Browser's Font Size

A few days back, I blogged about "how to force string wrapping" for mobile device display. In order to know when to break a string of text, we would like to know how many characters of text fit into any given html element on the page. Simple right? If you know an element's width, as I usually do since I use a fixed width 'body' div element and can make a reasonable guess as to what each child elements width will be, all I need is the width of each character. That way, I can find the number of characters per line for that particular element and break accordingly.

So how do you find the width of each character? Now we have a problem. There is no way to know before a page is rendered what the width of a character is for a given font, font size, font style, etc. in a given user's browser. Remember, we just specify what font and font size to we want them to use. The browser can and will use whatever font and font size it deems appropriate. This is especially true for mobile devices, which vary greatly in screen size, available fonts, available font size and browser software.

That said, in many of my scripts I use a set character width vs font size relationship hash:

my %character_width = (
 'font_large' => 13,
 'font_medium' => 10,
 'font_small' => 8,
 'font_x_small' => 6,
 'font_xx_small'=> 5,
);

...where 'font_large' is the character width in pixels for a style specification of font-size:large, 'font_x_small' is for font-size:x-small, etc. This actually works reasonably well in many cases. Sometimes your text will push outside it's container element, but on average it looks ok.

But if you want a more accurate measure of character width for each user, you can try this technique. What we do is render a page and let JavaScript tell us what character width is being used for each font size. Here is the html we use in the polling page:
....
<div>
<span id="xl" style="font-size: x-large;">Open Source</span>
    <span id="lg" style="font-size: large;">Open Source</span>
    <span id="md" style="font-size: medium;">Open Source</span>
    <span id="sm" style="font-size: small;">Open Source</span>
    <span id="xs" style="font-size: x-small;">Open Source</span>
    <span id="xx" style="font-size: xx-small;">Open Source</span>
</div>
....

Pretty basic. The surrounding div is usually set to style='visibility:hidden;', but I left it visible here for demo purposes. As you can see, just a simple line of text of known character length (11 characters in 'Open Source') for each font size span tag. And here is the JavaScript which sends character width information to your script:
....
function init_page(){
  var size_output = '';
  var site_url = 'http://your.website.com/cgi-bin/your_script.cgi?';
  var width_array = [ 'xl', 'lg', 'md', 'sm', 'xs', 'xx' ];
    
  for(var i=0; i<width_array.length; i++) {
    size_output += '&' + width_array[i]
      + '=' + get_character_width(width_array[i]);
  }
size_output = size_output.substring(1);

  // Send info to your script...
  window.location = site_url + size_output;
}

// Helper functions...
 
function get_character_width(el_id) {
    var output = 0;
    var element = document.getElementById(el_id);
    var text = element.innerHTML;

    if (typeof element.clip !== "undefined") {
       output = element.clip.width || 0;
    } else {
        if (element.style.pixelWidth) {
            output = element.style.pixelWidth || 0;
        } else {
            output = element.offsetWidth || 0;
          }
      }
    output = round_up_down(output/text.length, 0);
    
    if (output < 5) output = 5;   // practical limit is 5px per character
    return output;
}

function round_up_down(number, num_of_dec_places) {
    var num_to_10th_pow = parseInt(Math.pow(10, num_of_dec_places));
    return number = Math.round(number * num_to_10th_pow) / num_to_10th_pow;
}
....
Use the above in a 'polling' (index.html) page which loads the above JavaScript at page load. I use the Dojo Tookit, so I left that stuff out, but a simple <body onload="init_page();"> will work just fine. So the above JavaScript gets the overall size of each span of text, which has been set to each specific font size, divides it by the number of characters and voila, we have the character size in pixels. Cool eh? I just add this new info to my redirect uri string and it is sent to my script.

Now you might still be asking why this is important. Well as mobile sites become more advanced, there will be more info displayed from general databases and such. Since this data might not already be conditioned for mobile display, the above technique, among others, may become the norm and not the exception.

Here is an example of it in action. Go to my website: http://opensite.mobi. Watch as it first goes to our index.html page. Quickly it redirects to our site. Now look at the redirect uri in your browser's navigation line. I send the info bundled (I write a cookie to store this), but you will see something like 'rev_cookie=unknown:js_yes:14:10:9:8:6:6' if JavaScript was enabled. The numbers part is what we are interested here. Largest to smallest, we now have the character sizes as gathered for this browser. Note: on my website, once a cookie is written the index.html page skips the character size stuff, so you will have to clear cache to see it again if you try multiple times.

If you know of similar or other data preparation techniques for mobile display, please post a comment with code or a url. I searched a good bit some time ago, but you never always know what's really out there.