<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Authoritative Opinion &#187; Repository</title>
	<atom:link href="http://authoritativeopinion.com/blog/category/repository/feed/" rel="self" type="application/rss+xml" />
	<link>http://authoritativeopinion.com/blog</link>
	<description></description>
	<lastBuildDate>Mon, 19 Jul 2010 00:04:46 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1-alpha</generator>
		<item>
		<title>Digital Asset Management for Public Broadcasting: Interlude</title>
		<link>http://authoritativeopinion.com/blog/2010/05/28/digital-asset-management-for-public-broadcasting-blacklight-interlud/</link>
		<comments>http://authoritativeopinion.com/blog/2010/05/28/digital-asset-management-for-public-broadcasting-blacklight-interlud/#comments</comments>
		<pubDate>Fri, 28 May 2010 18:10:12 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://authoritativeopinion.com/blog/2010/05/28/digital-asset-management-for-public-broadcasting-blacklight-interlud/">chris</span></dc:creator>
				<category><![CDATA[Repository]]></category>
		<category><![CDATA[TODO]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=344</guid>
		<description><![CDATA[Just a quick update on my progress developing a shareable prototype. The basic integration work is functional, I&#8217;ve ripped out the previously-mentioned Camel workflow components in favor of ruote (which is so much easier to wrap my mind around &#8212; I&#8217;ve pushed the skeleton code for this out as a separate package called fedora-workflow), and [...]]]></description>
			<content:encoded><![CDATA[<p>Just a quick update on my progress developing a shareable prototype. The basic integration work is functional, I&#8217;ve ripped out the previously-mentioned Camel workflow components in favor of ruote (which is so much easier to wrap my mind around &#8212; I&#8217;ve pushed the skeleton code for this out as a separate package called <a href="http://github.com/cbeer/fedora-workflow">fedora-workflow</a>), and I&#8217;ve started doing some very basic datastream display work.</p>
<p>After this work is complete, I think a first-round alpha will be ready to publish within the next couple weeks.</p>
<div class='wp_likes' id='wp_likes_post-344'><a class='like' href="javascript:wp_likes.like(344);" title='' ><img src="http://authoritativeopinion.com/blog/wp-content/plugins/wp-likes/images/like.png" alt='' border='0'/>Like</a><span class='text'></span>
<div class='unlike'><a href="javascript:wp_likes.unlike(344);">Unlike</a></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://authoritativeopinion.com/blog/2010/05/28/digital-asset-management-for-public-broadcasting-blacklight-interlud/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital Asset Management for Public Broadcasting: Blacklight (Part 3 of ??)</title>
		<link>http://authoritativeopinion.com/blog/2010/05/10/digital-asset-management-for-public-broadcasting-blacklight-part-3-of/</link>
		<comments>http://authoritativeopinion.com/blog/2010/05/10/digital-asset-management-for-public-broadcasting-blacklight-part-3-of/#comments</comments>
		<pubDate>Mon, 10 May 2010 22:03:28 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://authoritativeopinion.com/blog/2010/05/10/digital-asset-management-for-public-broadcasting-blacklight-part-3-of/">chris</span></dc:creator>
				<category><![CDATA[Repository]]></category>
		<category><![CDATA[TODO]]></category>
		<category><![CDATA[blacklight]]></category>
		<category><![CDATA[digital asset management]]></category>
		<category><![CDATA[fedora]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=337</guid>
		<description><![CDATA[In the previous parts, I wrote about two &#8220;back-office&#8221; open source applications (and tangentially discussed a few others) that are well-established in their communities and can support a wide variety of repository services. While it may be philosophically important that these are open source applications, I would argue that the next parts, in which I [...]]]></description>
			<content:encoded><![CDATA[<p>In the previous parts, I wrote about two &#8220;back-office&#8221; open source applications (and tangentially discussed a few others) that are well-established in their communities and can support a wide variety of repository services. While it may be philosophically important that these are open source applications, I would argue that the next parts, in which I want to talk about services and applications on top of the repository infrastructure, are the more crucial and benefit tremendously from the ability to create and customize interfaces for specific use cases to the full extent necessary by anyone with a fairly broad skill-set.</p>
<p><a href="http://projectblacklight.org">Blacklight</a> grew out of a next-generation library catalog interface, and while it still has very firm roots in the library world, it is also being used for archives, digital collections, and institutional repository interfaces. It is also an open source application, based on the Ruby on Rails framework.</p>
<p>Out of the box, it is a fairly generic interface to a solr index (with a little sprinkling of optional MARC data) and some relatively benign application features (users, bookmarks, saved searches). Connecting it to our existing Solr index is fairly trivial, and just requires some little configuration changes:</p>
<pre name="code" class="ruby">
config[:index_fields] = {
    :field_names =&gt; [
      "dc.description",
      "dc.creator",
      "dc.publisher",
      "dc.subject",
      "dc.date",
      "dc.format"
    ],
    :labels =&gt; {
      "dc.description"           =&gt; "Description:",
      "dc.creator" =&gt; "Creator:",
      "dc.publisher" =&gt; "Publisher:",
      "dc.subject" =&gt; "Subject:",
      "dc.date" =&gt; "Date:",
      "dc.format" =&gt; "Format:"
    }
  }
</pre>
<p>Which gives you a very basic discovery interface into your collection.</p>
<p>Extending Blacklight to work with Fedora is also easy, so in less than 50 lines of code, I had full access to the Fedora web services APIs and SPARQL interface. Adding management interfaces was also simple, using normal Ruby of Rails techniques and with less than 500 lines of code, a passable repository manager interface was available and I could import assets and metadata.</p>
<p>Adding a security layer on top of the repository content is also easy, thanks to the work the UPEI team put into the <a href="http://www.fedora-commons.org/confluence/display/ISLANDORA/Islandora+Guide#IslandoraGuide-DrupalServletFilter">DrupalServletFilter</a>, which allows Fedora to authenticate users against any SQL database. Because of this, we can use the XACML policy language built into Fedora to do record-level security (which I confess, I don&#8217;t entirely understand, however, it is an enormously powerful and expressive language if you like XML verbiage). For storing re-use rights, I am very intrigued by <a href="http://odrl.net">the Open Digital Rights Language</a>, which can integrate with Fedora and Blacklight to express non-object-security rights (re-use, segmentation, etc) using my proof-of-concept <a href="http://github.com/cbeer/ruby-odrl">ruby-odrl</a>.</p>
<p>With these fundamentals in place (ingest services, security policies, and resource discovery), one can build more advanced services on top of the repository, like collections, batch and on-demand conversion/transcode services, export/transfer services (one-click &#8220;export to PBS COVE&#8221;?) &#8212; and, because this can be done as rails plug-ins, they are readily sharable outside of this single application and provide templates for others to continue to develop and extend similar services to evolving platforms.</p>
<p>Because setting up a Blacklight application is so painless, it would be easy for public broadcasting institutions to create custom-made (yet shareable) modules and views for specific purposes (news, productions, archiving, etc) that all share the same back-end infrastructure yet offer users an easy way to interact with their data in a way that makes sense for their work. As I mentioned in <a href="http://authoritativeopinion.com/blog/2010/05/04/digital-asset-management-for-public-broadcasting-fedora-commons-repository-part-1-of/">my Fedora article</a>, you aren&#8217;t limited to data you control and have locally, but can bring in data from external sources (say, pulling in metadata from the NPR API or an RSS feed from a stock footage house) and present it both coherently and cohesively.</p>
<p>I&#8217;m looking for a good source of freely available test data, and I would rather not invest too much time building a corpus of archival assets if there is something already existing. The biggest challenge I&#8217;m having is finding comprehensive metadata, but the closest I&#8217;ve come are some podcast feeds from sources like Democracy Now!, however that doesn&#8217;t capture the breadth of materials I&#8217;d like to demonstrate.</p>
<p>Finally, a couple requisite screen-shots now that there is something visual to work with, using the default Blacklight theme with some quick interface hacks.</p>

<a href='http://authoritativeopinion.com/blog/2010/05/10/digital-asset-management-for-public-broadcasting-blacklight-part-3-of/screen-shot-2010-05-10-at-9-12-21-am/' title='Screen shot 2010-05-10 at 9.12.21 AM'><img width="150" height="150" src="http://authoritativeopinion.com/blog/wp-content/uploads/2010/05/Screen-shot-2010-05-10-at-9.12.21-AM-150x150.png" class="attachment-thumbnail" alt="Screen shot 2010-05-10 at 9.12.21 AM" title="Screen shot 2010-05-10 at 9.12.21 AM" /></a>
<a href='http://authoritativeopinion.com/blog/2010/05/10/digital-asset-management-for-public-broadcasting-blacklight-part-3-of/screen-shot-2010-05-10-at-9-05-14-am/' title='Screen shot 2010-05-10 at 9.05.14 AM'><img width="150" height="150" src="http://authoritativeopinion.com/blog/wp-content/uploads/2010/05/Screen-shot-2010-05-10-at-9.05.14-AM-150x150.png" class="attachment-thumbnail" alt="Screen shot 2010-05-10 at 9.05.14 AM" title="Screen shot 2010-05-10 at 9.05.14 AM" /></a>

<div class='wp_likes' id='wp_likes_post-337'><a class='like' href="javascript:wp_likes.like(337);" title='' ><img src="http://authoritativeopinion.com/blog/wp-content/plugins/wp-likes/images/like.png" alt='' border='0'/>Like</a><span class='text'></span>
<div class='unlike'><a href="javascript:wp_likes.unlike(337);">Unlike</a></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://authoritativeopinion.com/blog/2010/05/10/digital-asset-management-for-public-broadcasting-blacklight-part-3-of/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital Asset Management for Public Broadcasting: Solr (Part 2 of ??)</title>
		<link>http://authoritativeopinion.com/blog/2010/05/08/digital-asset-management-for-public-broadcasting-solr-part-2-of/</link>
		<comments>http://authoritativeopinion.com/blog/2010/05/08/digital-asset-management-for-public-broadcasting-solr-part-2-of/#comments</comments>
		<pubDate>Sat, 08 May 2010 14:29:34 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://authoritativeopinion.com/blog/2010/05/08/digital-asset-management-for-public-broadcasting-solr-part-2-of/">chris</span></dc:creator>
				<category><![CDATA[Repository]]></category>
		<category><![CDATA[TODO]]></category>
		<category><![CDATA[digital asset management]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=335</guid>
		<description><![CDATA[The Lucene-based Apache Solr is an incredible platform for building decent search experiences with &#8212; especially compared to the &#8220;more traditional&#8221; database-driven approach with many SQL JOINs that it becomes difficult to efficiently add search features like stemming, ASCII-folding, term highlighting, facets, and synonyms which, I would argue, are essential parts of the discovery experience [...]]]></description>
			<content:encoded><![CDATA[<p>The Lucene-based <a href="http://lucene.apache.org/solr">Apache Solr</a> is an incredible platform for building decent search experiences with &#8212; especially compared to the &#8220;more traditional&#8221; database-driven approach with many SQL JOINs that it becomes difficult to efficiently add search features like stemming, ASCII-folding, term highlighting, facets, and synonyms which, I would argue, are essential parts of the discovery experience and you essentially get for free with Solr. Another benefit Solr provides is a foundation for many light-weight interfaces on top of a single index (or, across multiple indexes, because Solr enforces some decent scalability principles that make expanding to task-based indexes easier).</p>
<p>For a DAM project, each asset should appear in the search index with the basic layer of contributed metadata, relationships, metadata extracted from the assets, as well as the administrative metadata managed by Fedora. I would align the fields the the Dublin Core (and DCTerms) elements (which is probably all you can get users to contribute in any case). At this point, because legacy systems lack authority control, linked data, or otherwise, existing metadata is sparse, inaccurate, or limited, which means the entry-level bar is set pretty low, so targeting ease-of-use and metadata collection are the priorities. Eliding a lot of detail, here&#8217;s the skeleton schema:</p>
<pre name="code" class="xml">
  &lt;field name="id" type="string" indexed="true" stored="true" required="true" /&gt;
   &lt;field name="title" type="string" indexed="true" stored="true" multiValued="true"/&gt;
   &lt;field name="description" type="string" indexed="true" stored="true"/&gt;

   &lt;dynamicField name="dc.*" type="string" indexed="true" stored="true" multiValued="true"/&gt;
   &lt;dynamicField name="dcterms.*" type="string" indexed="true" stored="true" multiValued="true"/&gt;
   &lt;dynamicField name="rdf.*" type="string" indexed="true" stored="true" multiValued="true"/&gt;
   &lt;field name="text" type="text" indexed="true" stored="false" multiValued="true"/&gt;
   &lt;field name="payloads" type="payloads" indexed="true" stored="true"/&gt;
   &lt;field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/&gt;

   &lt;copyField source="title" dest="title_t" /&gt;
   &lt;copyField source="subject" dest="dc.subject" /&gt;
   &lt;copyField source="description" dest="description_t" /&gt;
   &lt;copyField source="comments" dest="text" /&gt;
   &lt;copyField source="dc.creator" dest="author" /&gt;
   &lt;copyField source="dc.*" dest="text" /&gt;
   &lt;copyField source="text" dest="text_rev" /&gt;
   &lt;copyField source="payloads" dest="text" /&gt;

  &lt;copyField source="dc.title" dest="dc.title_t" /&gt;
  &lt;copyField source="dc.description" dest="dc.description_t" /&gt;
  &lt;copyField source="dc.coverage" dest="dc.coverage_t" /&gt;
  &lt;copyField source="dc.contributor" dest="dc.contributor_t" /&gt;
  &lt;copyField source="dc.subject" dest="dc.subject_t" /&gt;
  &lt;copyField source="dc.contributor" dest="names_t" /&gt;
  &lt;copyField source="dc.coverage" dest="names_t" /&gt;
</pre>
<p>The new <a href="https://issues.apache.org/jira/browse/SOLR-1553">edismax query parser</a> provides a great balance of flexibility, advanced query features, and ease-of-use that it seems like an obvious choice here.</p>
<p>The only penalty you pay by using solr is having to keep the solr index synchronized with your data sources. For synchronizing data from Fedora, there are now a proliferation of options, ranging from the task-specific with java plugins like <a href="http://www.fedora-commons.org/confluence/display/FCSVCS/Generic+Search+Service+2.2">GSearch</a> and <a href="http://github.com/mediashelf/shelver">Shelver</a> to the more generic (ESBs and all that) like <a href="http://camel.apache.org/">Apache Camel</a> or the Ruote-based <a href="http://github.com/cbeer/fedora-workflow">Fedora Workflow</a> component. Because DAM likely involves many different workflows, I lean towards the more generic solutions. Lately, I&#8217;ve given Camel a try, and after a couple days of java-dependency-induced head pounding, I have something that works.</p>
<p>&#8212;</p>
<p>On twitter, <a href="http://twitter.com/johntynan/status/13400294844">John Tynan requested</a> a virtual machine image to encourage others to begin playing with this software, so I&#8217;ve actually begun building some of these pieces. Currently, I have Fedora/Camel/Solr/Blacklight installed and functional, but before I try to package it us, I feel like I should add an easy-to-use ingest system to get data in. </p>
<div class='wp_likes' id='wp_likes_post-335'><a class='like' href="javascript:wp_likes.like(335);" title='' ><img src="http://authoritativeopinion.com/blog/wp-content/plugins/wp-likes/images/like.png" alt='' border='0'/>Like</a><span class='text'></span>
<div class='unlike'><a href="javascript:wp_likes.unlike(335);">Unlike</a></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://authoritativeopinion.com/blog/2010/05/08/digital-asset-management-for-public-broadcasting-solr-part-2-of/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Digital Asset Management for Public Broadcasting: Fedora Commons Repository (Part 1 of ??)</title>
		<link>http://authoritativeopinion.com/blog/2010/05/04/digital-asset-management-for-public-broadcasting-fedora-commons-repository-part-1-of/</link>
		<comments>http://authoritativeopinion.com/blog/2010/05/04/digital-asset-management-for-public-broadcasting-fedora-commons-repository-part-1-of/#comments</comments>
		<pubDate>Tue, 04 May 2010 23:33:48 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://authoritativeopinion.com/blog/2010/05/04/digital-asset-management-for-public-broadcasting-fedora-commons-repository-part-1-of/">chris</span></dc:creator>
				<category><![CDATA[Repository]]></category>
		<category><![CDATA[TODO]]></category>
		<category><![CDATA[digital asset management]]></category>
		<category><![CDATA[public broadcasting]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=331</guid>
		<description><![CDATA[In my previous post, I provided a broad overview of the challenges and opportunities for developing an open source digital asset management system within the public broadcasting community, and described some fundamental technology that is already being developed and deployed within institutions. In this post, I want to look specifically at the role the Fedora [...]]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://authoritativeopinion.com/blog/2010/05/03/digital-asset-management-for-public-broadcasting-part-0-of/">my previous post</a>, I provided a broad overview of the challenges and opportunities for developing an open source digital asset management system within the public broadcasting community, and described some fundamental technology that is already being developed and deployed within institutions. In this post, I want to look specifically at the role the <a href="http://fedora-commons.org">Fedora Commons repository architecture</a> can play in this environment. Additional reading is available from the Fedora Commons wiki, especially the <a href="http://www.fedora-commons.org/confluence/display/FCR30/Getting+Started+with+Fedora">Getting Start with Fedora</a> article, which articulates some of the strengths of their approach in the abstract.</p>
<p>The <a href="http://www.fedora-commons.org/confluence/display/FCR30/Fedora+Digital+Object+Model">Fedora Commons data model</a> is built on top of the <a href="http://www.cnri.reston.va.us/k-w.html">Kahn/Wilensky Architecture</a>, which describes a data structure for primary digital objects (irrespective of the data or formats contained within). Already, this is an improvement over some systems, which differentiate between content types, relegating some content formats to second-class citizenship. By providing a single, fundamental data type, one can build consistent user experiences on top of the discoverable components and interact with the digital objects to GET THINGS DONE.</p>
<p>Within digital objects are datastreams, which may include both data and metadata about the object, and are treated equally (more or less&#8230;) Datastreams can carry revision information, integrity checks, and other provenance information. By not distinguishing between &#8220;digital&#8221; assets (for which data (e.g. the media files) are available electronically) and other kinds of assets (physical tapes, abstract entities, etc), an asset management system can encompass the full range of materials within an active media archive.</p>
<p>Digital objects can be assigned content model types, which stipulate the required (and optional) component datastreams, as well as define the services that operate on objects of that type. These content types are simply structured digital objects within the repository, allowing repository managers (and content creators, given a sufficient interface) to define the structure of their content rather than structuring their content to meet the needs of the digital asset management system.</p>
<p>Types of datastreams natively supported include Inline XML datastreams, Managed Content, Externally Referenced Content, and Redirects. The datastream types do not speak to the format of content stored within them (except for inline XML), which allows content creators to easily provide content to the repository without first worrying about transcoding materials or other barriers to accessioning content (which is certainly not to say that standardizing content types archived within the repository is problematic &#8212; just that it shouldn&#8217;t interfere with getting the materials in the first place). This variety of types allows content to be stored and managed in the most appropriate places, rather than arbitrarily requiring centralization or &#8220;physical&#8221; ownership of content. Within a distributed organization like public broadcasting, this could be a powerful concept that allows content creators to control and manage their content at various stages of distribution (and, while this could be accomplished within traditional database driven systems, it would require custom application logic to do, which is likely not scalable across a wide variety of applications, frameworks, and languages). </p>
<p>While all datastreams are equal, there are four (or more?) that are more equal than others:</p>
<p>- AUDIT, which stores the history of the digital object as it is modified.</p>
<p>- DC, a Qualified Dublin Core datastream, that provides a minimal level of interoperability for the most generic of repository management interfaces. This is also the only fundamentally required datastream (without specifying required elements within it), and really is the bare minimum of information necessary to assert the existence of an object (if it doesn&#8217;t have a title, identifier, or description, what is it we&#8217;re talking about exactly?)</p>
<p>- RELS-EXT (and INT), an RDF-XML datastream in which one can assert relationship to other digital objects (which may exist within the repository, but may also exist (or not exist) elsewhere). These relationships can be from any vocabulary and reference any type of object, which is handy when you are dealing with complex relationships between media archives assets. This datastream is also generally indexed in an RDF triple-store to provide relationship querying.</p>
<p>- POLICY, which stores XACML security policies for the digital object, which can be used to restrict access to the datastreams, services, or the object based on whatever the security needs are. Within the digital asset management context, this could also be used to restrict access to only media files, while still providing the metadata (so one could assert and describe the existence of an object, without actually sharing it for whatever reason, which seems atypical for some commercial solutions)</p>
<p>By default, these datastreams (and the digital object wrapper) are stored on the file system in relatively comprehensible ways, which is a bonus to implementors who can set up underlying hardware or other technology in traditional ways and just begin to use the software without too much fuss. There is ongoing development to build in support for additional and evolving standards around digital object storage, serialization, access, and other services which should only help with making the process as transparent as possible.</p>
<p>All of this technology and flexibility comes &#8220;free&#8221; with the repository architecture and doesn&#8217;t try to interfere with actually making use of the assets (except as restricted by security policies, of course), which allows different use cases to be expressed in the most logical and straightforward way (rather than trying to bend the use cases or system in an attempt to mimic some of the elements the user needs). As a starting point for developing a digital asset management solution for media, I believe it offers a good balance of flexibility and requirements that can ensure user needs are met without sacrificing durability.</p>
<p>So, how can Fedora be applied in a digital asset management context for public broadcasting? First and foremost, Fedora provides a trusted platform for managing and maintaining content for many different contexts (production, long-term archiving, etc) on top of a variety of hardware and standards. By managing metadata and data together, physical and digital assets can be revealed in a common interface (when appropriate) to meet the needs of researchers and scholars (for whom the knowledge of the existence of the asset is more essential than on-demand access). Finally, by offering a stable API to a variety of resources, use-case driven interfaces can be developed, shared, and maintained to meet different needs sensibly.</p>
<div class='wp_likes' id='wp_likes_post-331'><a class='like' href="javascript:wp_likes.like(331);" title='' ><img src="http://authoritativeopinion.com/blog/wp-content/plugins/wp-likes/images/like.png" alt='' border='0'/>Like</a><span class='text'></span>
<div class='unlike'><a href="javascript:wp_likes.unlike(331);">Unlike</a></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://authoritativeopinion.com/blog/2010/05/04/digital-asset-management-for-public-broadcasting-fedora-commons-repository-part-1-of/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Digital Asset Management for Public Broadcasting (Part 0 of ?)</title>
		<link>http://authoritativeopinion.com/blog/2010/05/03/digital-asset-management-for-public-broadcasting-part-0-of/</link>
		<comments>http://authoritativeopinion.com/blog/2010/05/03/digital-asset-management-for-public-broadcasting-part-0-of/#comments</comments>
		<pubDate>Mon, 03 May 2010 23:52:33 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://authoritativeopinion.com/blog/2010/05/03/digital-asset-management-for-public-broadcasting-part-0-of/">chris</span></dc:creator>
				<category><![CDATA[Repository]]></category>
		<category><![CDATA[TODO]]></category>
		<category><![CDATA[digital asset management]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[public broadcasting]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=328</guid>
		<description><![CDATA[Digital asset management is hard. Many people have solved many parts of the problem, but for a reasonably complex use-case, many of the existing solutions just aren&#8217;t there yet, especially within a vendor-driven world for a niche market within a niche market, which is concerned with all levels and life-cycles of an asset (from production, [...]]]></description>
			<content:encoded><![CDATA[<p>Digital asset management is hard. Many people have solved many parts of the problem, but for a reasonably complex use-case, many of the existing solutions just aren&#8217;t there yet, especially within a vendor-driven world for a niche market within a niche market, which is concerned with all levels and life-cycles of an asset (from production, to reuse, to archiving and back again), which is almost certainly not a profitable market given public broadcasting budgets. I believe this is an ideal area for the development of open source solutions based on some existing works of open source software.</p>
<p>The &#8220;easy&#8221; part in the DAM ecosystem, I would argue, is archiving the material and ensuring its long-term preservation (and accessibility!). I&#8217;ve done a couple projects and prototypes now based on the <a href="http://fedora-commons.org">Fedora Commons</a> repository architecture, and it seems to be a promising platform for this kind of development. Objects and datastreams are stored on the file-system, which IT staff are traditional prepared to manage (vs some unique database structure almost certainly obfuscated in layers of (de-)normalization). Fedora will happily manage security policies, object relationships, data transformation services, and (shortly) more advanced file system interactions, which exposing a (relatively) consistent HTTP interface.</p>
<p>Discovery interfaces are probably the next easiest piece, having been examined and developed out of the information sciences communities. Using a combination like Solr and Blacklight (deployed successfully for WGBH&#8217;s <a href="http://openvault.wgbh.org">Open Vault</a> website), one can rapidly create interfaces to the underlying content that satisfy the many use cases. With Solr, you get a bunch of discovery mechanisms and options, including relevancy, term highlighting, faceting, etc.</p>
<p>From here, we start getting into the hard parts. Ingest and metadata editing  is difficult to solve well in a content- and use-case- agnostic way, which is the approach most Systems seem to take. While the need for a generic asset management view is important (and solved!), if the collection of services fail to meet the needs of the users, encouraging adoption (nicely) is problematic. By using infrastructure elements with open and well-documented APIs, developers can extend and customize the user experiences to match the underlying data and processes. This is an area for which the adoption and support for open source projects can encourage sustainable development of these interfaces.</p>
<p>It seems like, after clearing these obstacles, many systems fail to account for the use and re-use of these objects within the media communities. Few systems account for batch encoding video and audio for web distribution, one-click publishing systems to blogs, social networking sites, or video portals, integration into broadcasting chains, etc &#8212; for very good reasons, there simply isn&#8217;t the incentive when faced with large upfront development costs for unique development. Given an open source platform, however, that supports (and encourages) sharable development of solutions, maybe we could start finding answers to these persistent problems (without re-inventing the wheel!).</p>
<p>I believe most of the core infrastructure pieces are there:<br />
- Fedora, as I mentioned, which provides preservation and management services;<br />
- Solr, which provides a discovery framework (and associated metadata extraction utilities like Tika);<br />
- Blacklight, which provides discovery and access services;<br />
- ESB or other workflow solutions like Camel, Ruote, or otherwise;<br />
- Generic metadata editing options, like XForms, Django, etc;<br />
- Open standards that allow for publishing and reuse (Atom, MediaRSS, RDF, ???);<br />
- FFMPEG, which offers encoding and transcode services.</p>
<p>It isn&#8217;t an extensive development problem, these are well-established communities in their fields, it&#8217;s a simple matter of getting initial momentum in tying the complex pieces together and creating interesting and useful services on top. </p>
<p>So, why aren&#8217;t we doing this? Money, time, lack of a collaborative/communicative culture, and apathy (and acceptance) of second-rate, buggy commercial solutions that fail to address all aspects of a media objects life-cycle as it goes from the rapid iterations in production to many different distribution channels back to relative obscurity in an archival context (until a new production pulls it out again). Without full support, no step in the process can realize the potential of the content and have the incentive to put in the hard work to ingest and describe the asset.</p>
<div class='wp_likes' id='wp_likes_post-328'><a class='like' href="javascript:wp_likes.like(328);" title='' ><img src="http://authoritativeopinion.com/blog/wp-content/plugins/wp-likes/images/like.png" alt='' border='0'/>Like</a><span class='text'></span>
<div class='unlike'><a href="javascript:wp_likes.unlike(328);">Unlike</a></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://authoritativeopinion.com/blog/2010/05/03/digital-asset-management-for-public-broadcasting-part-0-of/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Fedora and Microservices</title>
		<link>http://authoritativeopinion.com/blog/2010/03/04/fedora-and-microservices/</link>
		<comments>http://authoritativeopinion.com/blog/2010/03/04/fedora-and-microservices/#comments</comments>
		<pubDate>Fri, 05 Mar 2010 00:38:49 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://authoritativeopinion.com/blog/2010/03/04/fedora-and-microservices/">chris</span></dc:creator>
				<category><![CDATA[Repository]]></category>
		<category><![CDATA[digital repositories]]></category>
		<category><![CDATA[fedora]]></category>
		<category><![CDATA[microservices]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=300</guid>
		<description><![CDATA[In this post, I want to discuss repository architecture philosophies, although I will focus primarily on Fedora and California Digital Library microservices, there are some generalizations one can pull out of this. It would also be interesting to pull in some very different repository models, like iRODS or a triple-store-backed system, but that&#8217;s outside of [...]]]></description>
			<content:encoded><![CDATA[<p>In this post, I want to discuss repository architecture philosophies, although I will focus primarily on Fedora and California Digital Library microservices, there are some generalizations one can pull out of this. It would also be interesting to pull in some very different repository models, like iRODS or a triple-store-backed system, but that&#8217;s outside of my expertise.</p>
<h3>The basics</h3>
<p>This is not a section I really want to write, but I don&#8217;t know of a high-level answer to  &#8220;when we say repository, this is what we mean&#8221;. I spent a little time looking around for a summary, but more often than not I found more questions (or, perhaps more useful yet inappropriate for my purposes, technology-based answers rather than use-driven), so I&#8217;ve taken a stab at addressing what I believe are some key issues: </p>
<p>Repositories are a collection of services, with well-defined interfaces, for storing and managing data (both content and metadata) in a format-neutral, display-independent manner way.  Repositories can be used as preservation repositories,  as access repositories, as centralized aggregations of far-flung data, etc and operate on any scale for any audience. Furthermore, there are existing standards and agreements about what it means to be a certain type of repository (TDR, OAIS, etc). All of these repositories, however, share some common services &#8212; whether implemented as software, external processes, or manual processes.</p>
<p>Some essential repository services are:</p>
<ul>
<li>Identifier services, which may include assignment + registration</li>
<li>Storage services (although the content stored may be only pointers to the &#8220;actual&#8221; content)</li>
<li>Content identification, matching identifiers to content items</li>
<li>Ingest workflows</li>
<li>Access mechanisms</li>
</ul>
<p>Without these services in place, a repository system would face some difficult obstacles in creating and providing value-added services. Repositories may provide multiple flavors of these services, some of which may be defined in generally accepted standards, models, and specifications.</p>
<p>Other basic services which operate on top of the above services are fairly common in most well-developed repository frameworks include:</p>
<ul>
<li>Dissemination services, to transform repository data into other forms + formats</li>
<li>Authorization services</li>
</ul>
<p>More advanced services may include:</p>
<ul>
<li>preservation services, including checksum (generation + verification), file format migration, support for models like LOCKSS</li>
<li>relationship services, using an RDF triplestore or similar, offering SPARQL endpoints, interferencing, etc</li>
<li>discovery services, using Lucene/Solr/etc, to provide relevancy, optimized user experience, drill-down faceting</li>
</ul>
<p>These more advanced services are likely separate applications in the repository ecosystem and are generally useful utilities independent of any repository system. Repositories generally integrate with these external applications in a modular, mix-and-match manner using well-defined interfaces.</p>
<h3>Fedora</h3>
<p>One approach to repository services is the &#8220;repository-in-a-box&#8221; model, where you can install and configure a base set of services provided by a single application. Within this group of services, Fedora provides a very basic implementation of the core repository services (vs a full-stack application like DSpace, which provides production-ready user interfaces). Fedora bills itself as a Flexible, Extensible Digital Object Repository Architecture.</p>
<ul>
<li>Identifier services, through PIDGen which provides sequential identifiers per-namespace</li>
<li>maps http uris to deferenceable uris to files</li>
<li>REST + SOAP APIs for Ingest + Delivery</li>
<li>Dissemination services using WSDL</li>
<li>Authorization using XACML (and authentication using a number of plugins)</li>
<li>Integrates with the Mulgara triplestore and a Lucene index (by default)</li>
</ul>
<p>Fedora provides a many opportunities for customization and enhancements through custom development:</p>
<ul>
<li>the Fedora REST, SOAP, and triple-store APIs allow developers to build  on top of low-level services, which may include access interfaces, administrative interfaces, or otherwise</li>
<li>the Fedora application provides Java Messaging Services (JMS) events when objects within the repository are created, deleted, or modified, and developers can build applications that listen to these events  and trigger actions (Shelver &lt;<a href="http://yourmediashelf.com/blog/2010/03/01/blacklight-activefedora-and-shelver-interplay-between-searching-managing-and-indexing-in-a-repository-solution/">http://yourmediashelf.com/blog/2010/03/01/blacklight-activefedora-and-shelver-interplay-between-searching-managing-and-indexing-in-a-repository-solution/</a>&gt;, fedora-workflow &lt;<a href="http://github.com/cbeer/fedora-workflow">http://github.com/cbeer/fedora-workflow</a>&gt;, etc)</li>
<li>the Fedora application is build modularly, and Java developers are able to develop and use components as needed, if they conform to the Fedora interfaces</li>
</ul>
<p>As services go beyond the basic, common applications present in institutional repositories, enhanced repository services require custom development or supplemental services outside of the repository services. For most, this includes integration with a more advanced search provider (like Solr). At some point,  additional services can blur the lines between the repository services and front-end user interfaces (which have to respond to local customization to meet user needs).</p>
<p>Repository-independent services, or third-party services, require some wrapper to make them interoperable with the Fedora APIs, which makes integration with existing technology more difficult. Even Duraspace&#8217;s Duracloud offering is (currently) built as separate services with some possibility of storage-level integration. Preservation support services will bypass the repository APIs and provide those services against the file system instead.</p>
<p>Considering the services Fedora doesn&#8217;t provide or the obstacles Fedora creates in integration, many ask why they should start using Fedora anyway. The strongest response to this, I believe, is that it provides a common structure to basic repository services, while at the same time not creating major obstacles to future expansion or migration outside Fedora. Out of the box, Fedora provides a set of &#8220;training wheels&#8221; (ht Mike Giarlo &lt;<a href="http://lackoftalent.org/michael/blog/">http://lackoftalent.org/michael/blog/</a>&gt;) for repository services development that can be removed when unnecessary, but in the meantime offers structure for the creation of new repositories and support for repository services as needed.</p>
<h3>CDL Microservices</h3>
<p>Another approach to repository services are &#8220;microservices&#8221; like those designed by the California Digital Library (CDL), provide standards and specifications for individual repository services, which form a structure for standardized, mix-and-match repository services that can integrate, interoperate and take advantage of  existing technology independent of a repository application like Fedora. This, conceivably, allows all domain developers to take advantage of these common projects without using a specific technology. CDL provides microservices specifications for:</p>
<ul>
<li>identifier assignment + registration, using NOID, which can act as a CLI tool or a CGI service</li>
<li>file-system structures, using the Pairtree convention</li>
<li>data exchange and verification, using BagIt</li>
<li>access standards, using the ARK URL format</li>
</ul>
<p>The standards are developed inline the &#8220;UNIX philosophy&#8221;:</p>
<blockquote><p>  Write programs that do one thing and do it well. Write programs to work together.  &#8212; Doug McIlroy
</p></blockquote>
<p>These basic services can be organized and crafted using the existing capabilities in web servers, file systems, etc. More advanced services can act within this structure, using individual standards when needed. While significant development and customization may be required to get a microservices architecture to a useable state, the end result is more flexible and targeted to an institutions needs.</p>
<h3>Flexing Fedora</h3>
<p>These two approaches are certainly not incompatible, and Fedora is quite capable of using some of these micro-services standards under the hood (replacing custom developed approaches to these basic services). By taking this approach, Fedora could act as a management application on top of generic repository data, allow both Fedora-based and microservices-based services to operate on the data, and make it easier to reach around Fedora when necessary (or, go so far as to remove it entirely).</p>
<p>What follows is a short summary of on-going work in this area, which mostly focus on removing the Fedora-centric definitions of /how/ or /where/ services act. The majority of these ideas build on new developments and best practices (since Fedora was initially created) in the repository community as a result increased adoption or awareness of issues. Where available, I&#8217;ve included links to projects in-the-works.</p>
<p>Some of this work is quite easy to do:</p>
<ul>
<li>integration of NOID identifier services by creating a web-services consumer for Fedora identifier assignment &lt;<a href="http://gist.github.com/273584">http://gist.github.com/273584</a>&gt;</li>
<li>replacing the custom, timestamp-hash file store with a Pairtree structure (the prototype is limited, however, by Fedora&#8217;s hard-coded distinction between object and datastream filestores &lt;<a href="http://gist.github.com/280020">http://gist.github.com/280020</a>&gt;</li>
<li>using memento http headers to provide versioning &lt;<a href="http://www.fedora-commons.org/jira/browse/FCREPO-604">http://www.fedora-commons.org/jira/browse/FCREPO-604</a>&gt;</li>
</ul>
<p>Other projects that are more involved, and require more work than just creating new modules for Fedora:</p>
<ul>
<li>BagIt and SWORD ingest and dissemination options to replace the custom Atom structure &lt;<a href="http://fedora-commons.org/confluence/display/FCSVCS/SWORD-Fedora+1.2">http://fedora-commons.org/confluence/display/FCSVCS/SWORD-Fedora+1.2</a>&gt;</li>
<li>Integration of arbitrary ingest of structured data (perhaps similar to CDL&#8217;s 7train &lt;<a href="http://seventrain.sourceforge.net/">http://seventrain.sourceforge.net/</a>&gt;?)</li>
<li>Pluggable authn/authz, through the FESL project, JAAS should provide a pluggable authentication backend &lt;<a href="http://www.fedora-commons.org/confluence/display/DEV/Fedora+Enhanced+Security+Layer">http://www.fedora-commons.org/confluence/display/DEV/Fedora+Enhanced+Security+Layer</a>&gt;</li>
<li>support for arbitrary RDF metadata, forget RELS-EXT/RELS-INT &#8212; force that kind of decision into a disseminator and use a seamless API to pull back RDF triples (/object/{pid}/relationships) &lt;<a href="http://www.fedora-commons.org/confluence/display/DEV/Supporting+the+Semantic+Web+and+Linked+Data">http://www.fedora-commons.org/confluence/display/DEV/Supporting+the+Semantic+Web+and+Linked+Data</a>&gt;</li>
</ul>
<p>More advanced microservices integration is highly involved and would require a major re-work of the application:</p>
<ul>
<li>Two-way messaging queues (or file alteration monitors, or database update hooks) to allow Fedora to receive updates</li>
<li>decreased reliance on self-generated registries, I think the situation is getting better, but I&#8217;m not sure its fully there..</li>
<li>pluggable storage modules with intelligent filtering, routing, multiplexing, and rules mechanisms &#8212; the Akubra project may be doing (part of?) this &lt;<a href="http://www.fedora-commons.org/confluence/display/AKUBRA/Akubra+Project">http://www.fedora-commons.org/confluence/display/AKUBRA/Akubra+Project</a>&gt;</li>
<li>workflow support hooks, to allow integration and automation of workflow tools  (possibly a result of Hydra?)<br/>
</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://authoritativeopinion.com/blog/2010/03/04/fedora-and-microservices/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A Fedora in a Pairtree</title>
		<link>http://authoritativeopinion.com/blog/2010/01/18/a-fedora-in-a-pairtree/</link>
		<comments>http://authoritativeopinion.com/blog/2010/01/18/a-fedora-in-a-pairtree/#comments</comments>
		<pubDate>Mon, 18 Jan 2010 14:13:05 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://authoritativeopinion.com/blog/2010/01/18/a-fedora-in-a-pairtree/">chris</span></dc:creator>
				<category><![CDATA[Experiments]]></category>
		<category><![CDATA[Repository]]></category>
		<category><![CDATA[cdl]]></category>
		<category><![CDATA[digital library]]></category>
		<category><![CDATA[fedora]]></category>
		<category><![CDATA[micro-services]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=289</guid>
		<description><![CDATA[The California Digital Library (CDL) has released a number of exciting micro-services specifications for digital libraries. The Fedora repository from DuraSpace takes an opposite approach and has a monolithic applications comprised of a number of modules. With the modular approach, it should be possible to slip micro-services under the hood of Fedora easily. Here is [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.cdlib.org/inside/diglib/">California Digital Library</a> (CDL) has released a number of exciting micro-services specifications for digital libraries. The <a href="http://fedora-commons.org/">Fedora</a> repository from DuraSpace takes an opposite approach and has a monolithic applications comprised of a number of modules. With the modular approach, it should be possible to slip micro-services under the hood of Fedora easily.</p>
<p>Here is a first attempt at implementing the <a href="http://www.cdlib.org/inside/diglib/pairtree/pairtreespec.html">Pairtree filesystem hierarchy</a> for Fedora:</p>
<pre name="code" class="java">
package fedora.server.storage.lowlevel;

import java.io.File;
import java.util.Map;

import fedora.server.errors.LowlevelStorageException;

/**
 * @author Chris Beer
 */
class PairtreePathAlgorithm
        extends PathAlgorithm {

    private final String storeBase;

    private static final String SEP = File.separator;

    public PairtreePathAlgorithm(Map<String, ?> configuration) {
        super(configuration);
        storeBase = (String) configuration.get("storeBase");
    }

    @Override
    public final String get(String pid) throws LowlevelStorageException {
        return format(pid);
    }

    public String format(String pid) throws LowlevelStorageException {
        String pt = to_pairtree(pid);
		return storeBase + pt + "obj" + SEP + pid;
    }

    private String to_pairtree(String s) {
		String pt = SEP;
		String src = escape(s);

		int i = 0;
		while(i < src.length()) {
			pt += src.substring(i, i+2) + SEP;
            i+= 2;
		}

		if(i < src.length()) {
			pt += src.substring(i);
		}

		return pt;
    }
    private String escape(String s) {
		/*
		 Fedora PIDs do not support non-visible ASCII or the characters below,
		 so we skip hex encoding:
		 "   hex 22           <   hex 3c           ?   hex 3f
		 *   hex 2a           =   hex 3d           ^   hex 5e
		 +   hex 2b           >   hex 3e           |   hex 7c
		 ,   hex 2c
		 */
		return s.replace("/", "+").replace(":", "+").replace(".", ",");
    }
}
</pre>
<p>See also: <a href="http://gist.github.com/280020">http://gist.github.com/280020</a></p>
<p>This basic services replaces the Timestamp Path algorithm for FOXML storage and creates a minimally compliant Pairtree. A better implementation could add:</p>
<ul>
<li>Splitting Fedora datastreams into individual files on the filesystem. A first step would be to implement an appropriate managed content mapper</li>
<li>Add the appropriate identifier cleaning specified in §3. Much of this was omitted in this implementation, with the assumption that the repository core would handle identifier validation</li>
<li>The implementation should support pairtree initialization (§4). The current assumption is the repository maintainer would pre-establish a pairtree hierarchy for Fedora to populate. To do this properly, I think one would need to override the DefaultLowlevelStorageModule to add an initialization step.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://authoritativeopinion.com/blog/2010/01/18/a-fedora-in-a-pairtree/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fedora, Blacklight, and Ruby on Rails</title>
		<link>http://authoritativeopinion.com/blog/2009/10/04/fedora-blacklight-and-ruby-on-rails/</link>
		<comments>http://authoritativeopinion.com/blog/2009/10/04/fedora-blacklight-and-ruby-on-rails/#comments</comments>
		<pubDate>Sun, 04 Oct 2009 15:10:34 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://authoritativeopinion.com/blog/2009/10/04/fedora-blacklight-and-ruby-on-rails/">chris</span></dc:creator>
				<category><![CDATA[Repository]]></category>

		<guid isPermaLink="false">http://authoritativeopinion.com/blog/?p=211</guid>
		<description><![CDATA[I&#8217;ve been playing with Blacklight, a catalog interface built on solr, this weekend with fairly positive results. After some initial frustration trying to figure out the demo data, I switched gears and connected Blacklight to my own solr data source, populated by a Fedora repository. Two initial kinks here were: The unique identifier field `id` [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been playing with <a href="http://projectblacklight.org/">Blacklight</a>, a catalog interface built on solr, this weekend with fairly positive results. After some initial frustration trying to figure out the demo data, I switched gears and connected Blacklight to my own solr data source, populated by a Fedora repository.</p>
<p>Two initial kinks here were:</p>
<ul>
<li>The unique identifier field `id` is hard-coded into Blacklight, while my existing data used the field name `PID`; see <a href="http://jira.projectblacklight.org/jira/browse/CODEBASE-171">CODEBASE-171</a></li>
<li>The unique identifiers in my repository began with a qualified namespace in the form &#8220;org.example.repository&#8221;, which broke the Ruby on Rails default routing system</li>
</ul>
<p>My quick fix for the routing issue was to change the formatting requirements for the id field in the router, so my resource map now looks like:</p>
<pre class="ruby" name="code">

  map.resources(:catalog,
    :only => [:index, :show, :update],
  [&hellip;]
    :requirements => { :id => /([A-Za-z0-9]|-|\.)+:(([A-Za-z0-9])|-|~|_|(%[0-9A-F]{2}))+/ }
  )
</pre>
<p>The regular expression is a copy of the Fedora PID regular expression, but I&#8217;ve disallowed periods in the identifier name (but they are still legal in the namespace, which I imagine is common practice).</p>
<p>There is still a fair bit of work hooking in object views, but the catalog + discovery portions were quickly and easily done.</p>
<div class='wp_likes' id='wp_likes_post-211'><a class='like' href="javascript:wp_likes.like(211);" title='' ><img src="http://authoritativeopinion.com/blog/wp-content/plugins/wp-likes/images/like.png" alt='' border='0'/>Like</a><span class='text'></span>
<div class='unlike'><a href="javascript:wp_likes.unlike(211);">Unlike</a></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://authoritativeopinion.com/blog/2009/10/04/fedora-blacklight-and-ruby-on-rails/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
