Fedora + Ruote workflow system

Workflow flowchart

Here is a high-level diagram of a workflow system that combines Ruote, ActiveMQ, Fedora to create a flexible and extendable lightweight workflow system:

1 person likes this post.

Posted in Uncategorized.


Fedora, Blacklight, and Ruby on Rails

I’ve been playing with Blacklight, a catalog interface built on solr, this weekend with fairly positive results. After some initial frustration trying to figure out the demo data, I switched gears and connected Blacklight to my own solr data source, populated by a Fedora repository.

Two initial kinks here were:

  • The unique identifier field `id` is hard-coded into Blacklight, while my existing data used the field name `PID`; see CODEBASE-171
  • The unique identifiers in my repository began with a qualified namespace in the form “org.example.repository”, which broke the Ruby on Rails default routing system

My quick fix for the routing issue was to change the formatting requirements for the id field in the router, so my resource map now looks like:


  map.resources(:catalog,
    :only => [:index, :show, :update],
  […]
    :requirements => { :id => /([A-Za-z0-9]|-|\.)+:(([A-Za-z0-9])|-|~|_|(%[0-9A-F]{2}))+/ }
  )

The regular expression is a copy of the Fedora PID regular expression, but I’ve disallowed periods in the identifier name (but they are still legal in the namespace, which I imagine is common practice).

There is still a fair bit of work hooking in object views, but the catalog + discovery portions were quickly and easily done.

Posted in Repository.


BagIt workflows

Adapted from an email I just wrote, but I think there is some good resources here, so I thought I’d share more widely.

I’ve toyed around with the BagIt standard, and have a demonstrator for a very homogenous use-case (using Ruby, Ruote, and ruby-bagit) but it doesn’t factor into our DAM -> Fedora workflow yet. From my limited implementation, it would certainly be nice to see DAMS beginning to adopt it , if a few issues can be addressed, either by the standard or by convention.

The biggest issue with the BagIt standard at this point is that it is exclusively a framework for transferring a collection of files, but doesn’t yet provide a way to create complex/compound objects out of the contents. The Library of Congress has been using BagIt for their Chronicaling America newspaper project ( tech notes) , but the reconstruction of objects and relationships has been implicit (based on a file naming convention) or manually done. This probably works in the simplest cases, where each BagIt item can be mapped into a compound object with either limited or embedded metadata, but I’m not sure if this could be easily applied to the problem of creating and relating multiple (heterogeneous) complex objects. Ben O’Steen at Oxford has proposed an extension to add an RDF manifest to the BagIt package to provide this sort of relationships , but I haven’t pursued that further. There has also been some recent development around combining BagIt and OAI-ORE, which might be a better way of approaching the problem using existing standards.

A further wrinkle, at our end, is that our Fedora repository is holding compressed access copies of the content, which cannot be stored in the DAM (because the DAM content model fails to account for proxy objects or similar). I imagine this is going to be a problem with almost all large datastreams, and something infrastructure will have to adapt to.

Posted in Uncategorized.


Solr Data Input Handler

This week, I had the opportunity to write a data import handler (DIH) for the Solr search server, which elegantly mapped a mySQL database to the Solr schema. Before this, I had been writing small scripts with an XML output, because the scope of the underlying data wasn’t neatly contained in a single document or database. This is a new feature in Solr 1.3, and it really seems to make integrating search almost trivial, to the point where anyone who can write an SQL query can begin replacing the in-built fulltext engines with a Solr service, offering more flexibility, efficient faceting, and a document-centric view appropriate for search.

The basic skeleton looked something like this:

<dataConfig>
        <dataSource driver="com.mysql.jdbc.Driver" batchSize="-1" url="jdbc:mysql://localhost:3306/cms?zeroDateTimeBehavior=convertToNull" user="root" />
<document name="doc">
        <entity transformer="RegexTransformer" name="page" query="SELECT ... FROM ... JOIN ... JOIN ... JOIN ..">
<field column="title" name="dc.title" />
[...]
<field column="names" splitBy="," name="dc.contributor" />
        </entity>
    </document>
</dataConfig>

A couple things to note:
In the dataSource configuration, I’ve set the batchSize=”-1″, which lowers the number of rows kept in memory and prevents solr (and the servlet engine) from running out of memory

Second, in the jdbc configuration, I’m using zeroDateTimeBehavior=convertToNull, which is a very easy way of dealing with those pesky “0000-00-00 00:00:00″ dates that normally come out of the database, and allows solr to gracefully skip that field.

In some multivalued field declarations (like the names -> dc.contributor), I’m using the regex transformer, and its helper splitBy, to reverse a mySQL GROUP_CONCAT() field, which at least saves a query (and forces more of the data marshaling logic into the SQL query, leaving the Solr mapping fairly straightforward).

The Solr transformers look incredibly powerful and almost certainly worth pursuing further in the future. One update I eagerly await is the integration of the DIH with Solr Cell, a text+metadata extraction service, under [#SOLR-1358], which would let you merge previously extracted (or entered) metadata with the fulltext of documents. When this feature is added, I think I can pretty much give up on my transforming scripts and switch to the DIH for all purposes.

Posted in Uncategorized.


Improved Poster Generator

Partly due to a lack of complete functionality with the Obama Poster Generator, I decided to turn it into a much more generic and useful script for taking an image, smoothing/filtering it, and then changing the palette of the result. Most often this is used to generate things like President Obama’s campaign posters, any t-shirt graphic that has 2 to 4 colors and very smooth lines–at least, the ones based on existing images of people and things, and I think it could be used, in principle, to create the effect of some of the cel-shaded video games, in the style of Jet Grind Radio, or even to create an animation style similar to that of A Scanner Darkly or the Charles Schwab commercials. Why? A normal image contains 8 bits of data per color channel, if it is a truecolor image. If one were to take a picture of a “solid” color, then the lower order bits will be integral in the “texture” of the image, created by variations in the color strength. However, if we chop off these lower order bits and then repaint the whole image in a specific subset of the 24-bit truecolor image, but still using 24-bits to get very precise colors, the textures of things disappear, and the only variations that are visible correspond mostly to edges of things and differences in lighting. I’ve included an example of what our President would look like as a video game character, for effect, but the results aren’t the best because it would take a very long time to determine the correct palette for the effect that we want, and I simply generated a uniform palette that discards the lowest bits when converting, but the similarity is still apparent.

In practice, generating high quality animated-style images based on real photographs will require hours of tweaking the palette’s colors, and while I was able to get away with only a few bits of resolution (I used 2 bits), to have something that looks high quality or has a large number of colors, a larger resolution will be necessary. Interestingly enough, for resolutions above 4 bits, the image looks basically the same, but weird in subtle ways that humans have trouble pin pointing. A resolution like this with a very restricted palette (say, the 3 bit one, properly expanded to provide information for all of the combinations) could provide very fine-grained control of the animated styled.

Without further ado, a re-post of the original picture, a much more accurate campaign-style poster, and a video game/commercial version of a portrait of president Obama, along with the palette files that I used for each of them:

Original portrait of President Obama

Original portrait of President Obama

A Campaign Poster-style Version of the original portrait

A Campaign Poster-style Version of the original portrait

A more cartoony, Jet Grind Radio-style version of the Portrait

A more cartoony, Jet Grind Radio-style version of the Portrait

Palettes:
For the campaign poster (this will need modification on a per-image basis, but the colors should be the same): click here.
For the cartoon-y image (I would recommend making one that is more appropriate for how you want the image to be transformed, this was just a rough starting point): click here.

And, the code. My first contribution to the open source movement, incidentally.

% *************************************************************************
% Image Poster Generator, v2.0
% This filters, decreases the resolution, and then repaints and image with
% values from a specified palette.
%
%   Copyright (C) 2009 Greg Malysa
% *************************************************************************
%    This program is free software: you can redistribute it and/or modify
%    it under the terms of the GNU General Public License as published by
%    the Free Software Foundation, either version 3 of the License, or
%    (at your option) any later version.
%
%    This program is distributed in the hope that it will be useful,
%    but WITHOUT ANY WARRANTY; without even the implied warranty of
%    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
%    GNU General Public License for more details.
%
%    You should have received a copy of the GNU General Public License
%    along with this program.  If not, see < http://www.gnu.org/licenses/ >.
% *************************************************************************

% Specify the names of the palette image, the input, and the output
paletteName = 'palette2bit.png';
inputName = 'obama.jpg';
outputName = 'obama_filt_out.png';
outputType = 'png';

% The darken factor divides all of the pixel values in the image by the
% amount given here in order to change the brightness or darkness of the
% image as a whole.
darken = 1.35;

% Amount of smoothing to apply, between 1 and any integer. Large values
% will make things become far too smooth though, so be careful. 1 does no
% smoothing.
smooth = 4;

% The new target resolution (in bits), which also determines the palette
% size. This must be >= 1 or nothing particularly interesting happens.
% Higher resolutions require larger palettes and allow for more colors.
% With resolution = 8, nothing happens, as all of these images are 8-bits
% per channel anyway (24-bit truecolor images).
targetBits = 2;
paletteSize = 2^targetBits;
divideFactor = 256 / paletteSize;

% The palette itself is a column of bits number of bits by bits blocks,
% where the block number indicates the red channel index, the x offset
% within a block indicates the green channel index, and the y offset
% indicates the blue channel index. For some reason, MATLAB specifies y
% first and x second when accessing an image.
paletteBase = cast(imread(paletteName), 'uint8');
palette = zeros([paletteSize paletteSize paletteSize 3]);
for block = 1:paletteSize
    blockBase = (block-1)*paletteSize;
    for green = 1:paletteSize   %x
        for blue = 1:paletteSize    %y
            palette(block, green, blue, :) = paletteBase(blue, blockBase+green, :);
        end
    end
end

% Format is columns by lines by channels (y, x, 3)
image = imread(inputName)/darken;

% Create a filter coefficient matrix based on the order of smoothness
% requested (this is like a 2-D averager, which is a form of low-pass
% filter). This can be changed to create an edge-detector, for instance.
B = ones(smooth)/(smooth*smooth);

% Apply to each channel separately
imagenew(:,:,1) = filter2(B, image(:,:,1));
imagenew(:,:,2) = filter2(B, image(:,:,2));
imagenew(:,:,3) = filter2(B, image(:,:,3));

% Decrease the resolution based on the bit number requested above, dealing
% with some inconsistencies between MATLAB's casting and traditional
% conventions, and MATLAB's array access mechanisms.
imagenew = cast(imagenew, 'double');
image = cast(floor(imagenew/divideFactor), 'uint8')+1;

% Channel 1 is red, channel 2 is green, and channel 3 is blue
for n = 1:length(image(:,1,1))
    for k = 1:length(image(1,:,1))
        image(n, k, :) = palette(image(n, k, 1), image(n, k, 2), image(n, k, 3), :);
    end
end

% Write out the final image
imwrite(cast(image, 'uint8'), outputName, outputType);

Posted in Code, Computer Science.

Tagged with .


Obama Poster Generator

While playing with the idea of hiding images-within-images (aka steganography), I realized that by applying the right thresholds in The Gimp to the 4-bit downscaled version of the image I was trying to hide, it looked sort of like the posters that President Obama used during his campaign, except without the color. Inspired, I hacked together a MATLAB script that will do a reasonable job of converting an image from normal, full color into an image consisting of a peachy color, dark navy, light blue, and red, in a style similar to the campaign posters.

% This Obama-izes an image, applying the basic four-ish color pattern
% utilized in Obama's campaign posters.

% Format is lines by columns by channels (x, y, 3)
image = imread('guitar.jpg', 'jpg');

% Downsample into five sections based on intensity, each 64 values wide
image = cast((image+1)/64, 'uint8');

% Channel 1 is red, channel 2 is green, and channel 3 is blue
for n = 1:length(image(:,1,1))
    for k = 1:length(image(1,:,1))
        if (image(n, k, 1) > 1 && image(n, k, 3) > 1)
            image(n, k, 1) = 252;
            image(n, k, 2) = 228;
            image(n, k, 3) = 166;
        elseif (image(n, k, 1) > image(n, k, 3))
            image(n, k, 1) = 217;
            image(n, k, 2) = 26;
            image(n, k, 3) = 33;
        elseif (image(n, k, 3) >= image(n, k, 1) && image(n, k, 3) > 0)
            image(n, k, 1) = 112;
            image(n, k, 2) = 151;
            image(n, k, 3) = 158;
        else
            image(n, k, 1) = 0;
            image(n, k, 2) = 52;
            image(n, k, 3) = 76;
        end
    end
end

imwrite(image, 'guitar_out.png', 'png');

The results were pretty good for a few minutes of hacking:

I had a lot of trouble finding a “universal” set of conversion rules, instead mostly looking at the output of each image and tweaking the thresholds in each test and whether to use > or >= for the last check before falling back to the dark navy color. A more sophisticated approach to the conversion process would probably yield far better results and require less tweaking to work on each image, but I was surprised by how simple generating a reasonable approximation was.

A more versatile version of this code can be found at the Improved Poster Generator.

Note: I found three of these pictures on the Internet. They are not mine and I make no claim to them. The resulting transformed images (all four) may belong to me; if so, you are free to reproduce them, but I ask that you mention where you found them.

Posted in Computer Science.

Tagged with .


Analog Multiplexer (Update–9/25/09)

After posting about digital multiplexers and decoders, I started thinking about analog multiplexers, which I have not used before, but which can be quite useful, for instance in bypassing an effects pedal on a guitar. In principle, because a CMOS transistor can function as a connected/disconnected switch for digital circuits, I figured that it would work similarly for analog circuits, and threw together this 2:1 analog multiplexer:

Analog 2:1 Multiplexer Circuit

V1 and V2 combine to form a single 0-5 V supply, I simply broke the two apart to demonstrate how the signal biasing could work. V3 and V6 represent the output pins on a microcontroller that are used to select between inputs—they must be able to source a few mA when driven high (this shouldn’t be a problem for practically all microcontrollers, on real I/O pins). A simple logic circuit (i.e. decoder) could be used to take an encoded representation of the desired signal and drive these inputs appropriately. V4 and V5 are the input signals, referenced to any other signal, and they are properly biased (a good practice, even if they are supposedly referenced correctly) to ensure they are within the operating region of the rest of the circuit.

I chose 0-5 V rails to make the mux usable on any board that already had similar supplies for, say, a PIC microcontroller. Not using a -5 V rail simplifies the power requirements for simple circuits, but it does require that the signal be biased properly, in order to avoid clipping as it passes through the output buffer. The resistors and capacitors on the biasing circuit were selected to provide a cutoff frequency of ~0.3 Hz on the implicit high-pass filter, in order to avoid attenuating the low frequencies in audio applications and attempt to provide a relatively flat passband in the entire bandwidth of the NMOS transistors, which was approximately 1 MHz in the SPICE simulation I used. For higher frequency usage only, the values of both can be adjusted in order to make the circuit components easier to find, but care must be taken to limit the current that the input will be required to source/sink. If a piezoelectric transducer is used, its current capabilities are rather limited, so high resistances are preferable, as well as low capacitances (1 uF is large in this case, I think). A buffer stage could be introduced on the front end, but it would again require proper biasing to ensure that the signal is not clipped, which is highly likely using only a single +5 V supply.

This circuit is provided for academic/hobbyist purposes. If you really need to use an analog multiplexer, I suggest you sample or purchase one from Analog Devices because it will be more robust and have fewer caveats than this circuit does (for instance, if you drive both inputs high on this circuit, it will gladly add the two signals together, which might be useful instead of using a summing amplifier). Comments and other feedback are welcome, as I’m by no means good at analog design, and I only did SPICE simulations for this circuit—I haven’t had a chance to build it and try it in the real world yet.

[UPDATE]
I was planning on building this circuit recently, both for a separate project and to test it out, when some shopping on Digi-Key led me to find that it is very difficult to purchase MOSFETs with substrate connections. MOSFETS are four terminal devices, but most common applications do not use them this way. Digi-Key has over 16,000 differently listed MOSFETS and no easy way to search for a four-terminal one (many come in larger packages with multiple pins connected to the drain and source) so I abandoned it and spent some time working on an alternative circuit. A combination of MOSFETs and BJTs should be sufficient to emulate the four-terminal device, but it imposes additional restrictions on the signals that can be passed through for a given power supply. The threshold voltages of BJTs gates require that there be a (device specific) gap between the maximum value of the signal and the power rails, but most operational amplifiers have similar restrictions (although not as large in magnitude). To that end, this is the modified circuit that gives the same results in simulation. I’ve also added a pull-down resistor (which can be tied to any of your negative rail 0V, the positive rail +5V, or a constant voltage available) to deal with the input/offset current of the op amp. This resistor should ideally be as big as possible (5 MOhm like the others is fine), but it will, in theory, attenuate all frequencies equally, which means that something as low as 10 or 100 kOhm is acceptable, depending on how wide the voltage margins your op amp requires are and how much current your signal transducers can source.

2:1 Analog Multiplexer

Posted in Hardware.

Tagged with , .


Repository workflows

One of the sessions at RIRI ‘09, and a common theme across many of the conferences I attended this year, is workflows, workflow engines, and tools.

Workflows initial came out of “enterprise” systems to managing web services, which moved into the repository in the hands of the repository managers. They establish, in advance, rule based workflows to manage (primarily) submission and dissemination with the Business Process Execution Language (BPEL). While this may be perfect for describing complex interactions, the burden of creating workflows and additional overhead makes this approach seem like overkill.

From the scientific community comes two more basic workflow systems, Taverna and Kepler. The key difference between these systems and BPEL seems to be the intended audience, which for these two applications are the scientists themselves looking to manage and marshal their data in a manner specific to their needs. At this point, many of the applications seem ad-hoc (although, judging by myExperiement, seem to be gathering interest in the scientific community. While certainly applicable to making use of the material in a repository environment, at this point it seems like its application to repository management may be questionable.

A third option is a programmatic workflow engine like Ruote, which allows one to specify business processes in either Ruby, JSON, or an XML syntax, and can be linked with Fedora’s Java Messaging Service using stomp and some Fedora objects.

After the fold, I’ve outlined a very basic Ruote workflow for updating a solr search index every time a Fedora object is updated, similar to the GSearch plugin. Forgive my rather ugly Ruby code, this is just a quick sketch of a possible service.
Continued…

2 people like this post.

Posted in Code.

Tagged with , , , .


Video4All: HTML5 <video> alternative [updated]

Earlier this week, Matt Mastracci released his video4all project, which replaces the HTML5 <video> element with a flowplayer-based alternative for non-compatible browsers. Independently, I’ve been working on bringing the HTML5 javascript API to some video plugins using a javascript wrapper. At this point it is still very basic, but hopefully proves useful or interesting. I’ve created a basic flowplayer version, which currently requires the Prototype javascript library (although should trivially port to the flowplayer subset). This layer supports functionality like play/pause, volume control, seeking/currentTime and metadata as well as more advanced features like cue ranges. Error states and events are not yet supported.

Here is a very basic demo that demonstrates play/pause and seeking to a time. I’ve only tested this in Safari 4/Firefox 3.5, but I believe it should work in earlier versions. There is some __getter_/__setter__ javascript which likely fails in Internet Explorer (although I am aware of a project that offers a workaround, I haven’t tried it out yet).

Posted in Code.


MALLET topic analysis of JCDL + Open Video tweets

I’m working towards some interesting visualizations of the twitter streams from a number of conferences (starting with JCDL and Open Video this last week). I’m using Judith Bush’s very cool gawk script to parse up the raw atom files. My first step was to get topics for the corpus as a whole:

/Applications/mallet/bin/mallet train-topics --input data.mallet --num-topics 10 --output-state topic-stat.gz --output-doc-topics doc-topics --output-topic-keys doc-keys --num-iterations 2000 --optimize-interval 2500

JCDL

0	5	http bit ly org interesting marshall analysis wolf week existing pizza people
1	5	jcdl books data don works problem target foundation facilitate creating
2	5	jcdl libraries evaluation future discussion day multiple public lots univ
3	5	jcdl paper lightweight music back issues funny build dog
4	5	session user talk search talking papers documents great collection type tatted
5	5	conference library good mentors content students focus run building pints
6	5	jcdlgoogle www law participation dl dchud online nice bats duck
7	5	jcdl austin poster google tomorrow small librarian tonight nice
8	5	jcdl digital tags question collections social wikipedia war
9	5	workshop people time quality study alan live archive idea lots

Open Video

0	5	video conference open source net making metadata mozilla developers adobe brokep learned system presentation long openvideo ly app msf
1	5	openvideo ovc tv time gd week vlc html stuff folks nyc platform google meet checking slides startrek kdnlf ll
2	5	media goodman amy watch good great mainstream idea war days im tr flash tpb change put films class devine
3	5	openvideo youtube rt videos session world xenijardin system room art doesn show iran channel film audio totally activism presentation
4	5	openvideo content pirate public live sunde peter cc jardin project creative keynote speaker ogg sweden twitpic licensed seminar fisl
5	5	openvideo people internet talk access day tinyurl conf vid online storytelling awesome working hack digital miro final evolution similar
6	5	openvideo de free en la xeni el years amazing copyright film blog education works closed msurman tk iranian tagged
7	5	openvideo amp ted don great work culture fair back editing question technology site cable id lecture wiki form youtube
8	5	http bit ly check interviews wrap royblumenthal creativecommons based casts ll website footage archives ogg rad blogposts
9	5	openvideo org www openvideoconference make http web watching foss roflmemes put hope sessions online cool launches marketing rest rt

Future work will include temporal analysis and “speaker” analysis.

Posted in Uncategorized.