Tuesday, March 15, 2016

Sitecore: Extract Indexed Content of Media Files using MediaItemContentExtractor

Here is something in addition to my previous post regarding indexing associated content:

Here is a common scenario:
Your custom index configuration is set up to crawl all the content for your website which is then used by your site search (keywords search) to fetch search results. In addition to you content item crawlers, you add a crawler for Media Library items as well and Sitecore does a great job of indexing PDF, DOCX, DOC, etc. files automatically, provided your have a valid IFilter installed, and now you have search extended to show file items as search results.

Now consider the following scenario:
One of the lookup fields on your page points to a file in the media library and the new requirement is to show the page item in the search result when the search phrase matches the content in the associated file.

Solution (Lucene & Solr): Create a computed field called "related_content" that stored the crawled content of the associate file and extend the query to now search both "_content" (in Solr and I think it's simply "content" in lucene) and "related_content" fields.

Here is the code:

        
using System;
using System.Text;
using System.Xml;
using OConnell.Domain.Models.OConnell.Intranet.Components;
using OConnell.SC.Extensions;
using Sitecore.ContentSearch;
using Sitecore.ContentSearch.ComputedFields;
using Sitecore.Data.Items;
using Sitecore.Diagnostics;

namespace OConnell.SC.Search.ComputedFields
{
    public class RelatedContent : IComputedIndexField
    {
        public string FieldName { get; set; }
        public string ReturnType { get; set; }

        public object ComputeFieldValue(IIndexable indexable)
        {
            // PDF,File,Docx,Document,Doc template IDs (Unversioned)
            private readonly string _mediaItemTemplates =
            "{0603F166-35B8-469F-8123-E8D87BEDC171}|{962B53C4-F93B-4DF9-9821-415C867B8903}|{7BB0411F-50CD-4C21-AD8F-1FCDE7C3AFFE}|{777F0C76-D712-46EA-9F40-371ACDA18A1C}|{16692733-9A61-45E6-B0D4-4C0C06F8DD3C}";
           Item item = indexable as SitecoreIndexableItem;
            if (item == null)
            {
                Log.Debug(string.Concat("Rejected item at path: ", indexable.AbsolutePath));
                return null;
            }
            Log.Debug(string.Concat("Getting related content for item at path: ", indexable.AbsolutePath));
            var sb = new StringBuilder();

            try
            {
                //change the field name to your lookup field
                var mediaItem = item.Database.GetItem(item.Fields["Related File"].Value);
                if (mediaItem != null && _mediaItemTemplates.Contains(mediaItem.TemplateID.ToString()))
                {
                    var indexedContent = GetFileContent(mediaItem);
                    sb.Append(indexedContent.ToString() ?? string.Empty);
                }
            }
            catch (Exception ex)
            {
                Log.Error(ex.Message,item);
            }
           
            return string.IsNullOrEmpty(sb.ToString()) ? null : sb.ToString();
        }

        private string GetFileContent(SitecoreIndexableItem indexableMediaItem)
        {
            XmlNode configurationNode =
                Sitecore.Configuration.Factory.GetConfigNode(
                    "contentSearch/indexConfigurations/defaultSolrIndexConfiguration/mediaIndexing");

            //MediaItemContentExtractor expects the full xml 
            //including the "mediaIndexing" node and GetConfigNode seems to 
            //be ommitting the parent node and hence loading it as XML before passing 
            //passing to MediaItemContentExtractor
            var xmlDocument = new XmlDocument();
            xmlDocument.LoadXml(configurationNode.OuterXml);
            var extractor = new MediaItemContentExtractor(xmlDocument);
            var indexedContent = extractor.ComputeFieldValue(indexableMediaItem);
            return indexedContent == null ? string.Empty : indexedContent.ToString();
        }
    }
}

Don't forget to add the configuration element for your computed field:

 
<fields hint="raw:AddComputedIndexField">
    <field fieldName="related_content">Sitecore.SharedSource.ComputedFields.RelatedContent, Sitecore.SharedSource
    </field>
</fields>
    

Monday, August 3, 2015

Sitecore: Indexing Associated Content

Setting up indexes and indexing content for site search in Sitecore is a pretty straightforward task and there is an extensive knowledge base put together by the community with various examples. One useful construct we often find ourselves implementing for keyword search is setting up a computed field. I typically use this construct to index any additional content referenced by a page, typically (content blocks, promos, callouts) added to the page via presentation details


The Need: Index externally reference content by a page item
Solve: Create a computed field to index TextField and HtmlText type fields of referenced items

This implementation fetches all the renderings for the current item's presentation for the default device and checks their datasource item for index-able content.
As a suggestion, check if the current item inherits from certain page templates else skip the execution.

Step1: Create a class and implemented the IComputedIndexField interface as shown below:

    public class RelatedContent : IComputedIndexField
    {
        public const string DefaultDeviceId = "{FE5D7FDF-89C0-4D99-9AA3-B5FBD009C9F3}";
        public string FieldName { get; set; }
        public string ReturnType { get; set; }

        public object ComputeFieldValue(IIndexable indexable)
        {
            Item item = indexable as SitecoreIndexableItem;
            //add condition to skip if the current item does not belong to a page template
            var sb = new StringBuilder();
            var masterDb = Factory.GetDatabase("master");
            DeviceItem defaultDevice = masterDb.GetItem(DefaultDeviceId);
            RenderingReference[] renderings = item.Visualization.GetRenderings(defaultDevice, true);
            foreach (RenderingReference rendering in renderings)
            {
                if (string.IsNullOrEmpty(rendering.Settings.DataSource))
                    continue;

                Item datasourceItem = item.Database.GetItem(rendering.Settings.DataSource);
                if (datasourceItem == null) continue;

                //add an if condition to get indexable content for certain template types
                sb = GetIndexableContent(sb, datasourceItem);
            }
            return string.IsNullOrEmpty(sb.ToString())?null:sb.ToString();
        }

        private StringBuilder GetIndexableContent(StringBuilder sb, Item item,bool indexAllTextFields = true, string fieldName="")
        {
            if(sb==null) sb = new StringBuilder();

            if (!string.IsNullOrEmpty(fieldName))
            {
                if (item.Fields[fieldName] != null
                && !string.IsNullOrEmpty(item.Fields[fieldName].Value))
                    sb.Append(item.Fields[fieldName] + " ");
            }

            if (indexAllTextFields)
            {
                //skip standard fields by checking for "_" in name
                foreach (Field field in item.Fields.Where(x=>!x.Name.StartsWith("_")))
                {
                    var customField = FieldTypeManager.GetField(field);

                    if (!string.IsNullOrEmpty(customField.Value))
                    {
                        if (customField is TextField)
                            sb.Append(customField.Value.Trim() + " ");
                        else if(customField is HtmlField)
                            sb.Append(Sitecore.StringUtil.RemoveTags(customField.Value.Trim()) + " ");
                    }
                }
            }
            return sb;
        }
    }


STEP 2: Add the following to your custom index configuration

For Sitecore Lucene:

<configuration ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration" >
    <fields hint="raw:AddComputedIndexField">
        <field fieldName="related_content"></classname/>, </assemblyname/></field>
    </fields>
</configuration>

For Solr:

<configuration ref="contentSearch/indexConfigurations/defaultSolrIndexConfiguration" >
    <fields hint="raw:AddComputedIndexField">
        <field fieldName="related_content"></classname/>, </assemblyname/></field>
    </fields>
</configuration>


That's it! oh, and don't forget to rebuild your index.

If you like this approach or would like to suggest a better approach, please leave a comment.

Tuesday, January 20, 2015

Sitecore Tabbed Select Rendering dialog

If you aren't already doing this, you should be!

Found an interesting module (unable to find it on Sitecore Marketplace) to show renderings grouped into tabs while using the Select Rendering dialog from page editor.

The idea is to go from this:


To:




The approach is promising with a few updates:

  • Tried implementing it by inheriting from SelectRenderingForm and it did not work. No rendering items were being listed. Had to decompile Sitecore.Client.dll and copy the code from Sitecore.Shell.Applications.Dialogs.SelectRendering.SelectRenderingForm
  • Had to comment the following line in OnLoad 
this.Renderings.InnerHtml = this.RenderPreviews((IEnumerable<Item>)renderingOptions.Items);
  • Found a few issues with the styles, could have be an updated style sheet for Sitecore 7.5, but had to add the following attributes to the tabstrip
Background="white" Padding="0px" style="position:absolute; width:100%;top:0px"
Here is a detailed description of my implementation.

Wednesday, July 23, 2014

Structuring Content Within A Site In Sitecore

Recently, in a discussion with the team regarding how to structure content in Sitecore brought up some very interesting theories which I thought would be useful to share.

So, my theory was that enterprise content should be grouped reflecting the business structure, as I discuss it here) and taking it one step further, using the same logic to structure content within a site. But this idea was promptly met with rejection. The counter theory offered was that the content should actually follow navigation paths used by visiting user personas. This immediately brought to light a rather large assumption:

      Do we assume that the content tree structure in Sitecore dictates the navigation paths for end users?
The answer is, not necessarily. In my view, the minimum requirement should be to facilitate logical grouping of content that always avoids content duplication.

I think there are two primary ways of implementing navigation paths:
  1. Create navigation components such as Primary Nav and Left Rail Nav, which offer explicit suggestion to the user regarding how the business assumes the user should navigate the site.
  2.  Navigation using call to actions, which is rather implicit using right rail promos or other promo components giving the user a choice every step of the way.

Creating a well-structured content tree defined by the business that reflects explicit (suggested) navigation paths does makes it easy to facilitate that type of navigation via presentation components but more importantly, it offers more meaningful deep links for SEO. But it certainly does not mean that you cannot have more than one way to reach a content item.

In case you are thinking of using aliases please read John West’s in his blog before you make that decision.


I think the best way forward is to have a good balance of explicit and implicit navigation paths. Let the business decide what they suggest should be the navigation path for content and structure the content accordingly but let the marketing folks come up with personalization and call to actions/promos by authoring content that creates implicit navigation paths for personas. Do let me know what you think and what has served you the best.

Friday, July 11, 2014

Enterprise Inheritance and Content Hierarchy

This post is my humble attempt to share what I have discovered from my experience to be a best practice in implementing IA (Information Architecture) and content hierarchy for an enterprise implementing manage sites within a single Sitecore instance.

As this is my first Sitecore blog, not to mention my first blog post ever, please feel free to comment and critique my approach as I attempt to contribute to the incredibly talented Sitecore developer community.
So let’s just jump into it, consider the following scenario: an enterprise has just acquired Sitecore as their CMS platform and wants to strategize on how they should implement IA and content hierarchy so that they maximize reuse and have a Sitecore instance that scales really well.

It is unsurprisingly common for businesses to decide to manage the entire company’s content within a single Sitecore instance and why not, since this is one of the easiest selling points for Sitecore, “Buy a single instance and you can build any many web sites as you want”. 

The problem however is, as everyone in the Sitecore community knows, Sitecore installs as a blank slate with a “Home” node indicating it as your Site’s landing page. So intuitively, some eager .Net architects-turned-Sitecore architects tend to design content hierarchy something similar to:


·       Site 1
o   Home
§  Contact Us
§  Products
§  Services
§  
o   Shared Content
·       Site 2
o   Home
§  Contact Us
§  Products
§  Services
§  
o   Shared Content
·       
This then follows IA similar to
·       Templates
o   User Defined
§  Site 1 templates
§  Site 2 templates
·       Sublayouts
o   Site 1
§  Sublayout 1
§  
·       Layouts
o   Site 1
§  Sublayout 1

§  

As more experienced Sitecore architects/developers would immediately realize, there are several problems with this approach reusability, duplication of content, difficulty in content authoring, no scalability etc.

So what is the solution? Well it depends. But the general rule should be “Let content structure reflect the company structure”

What do I mean by that? Take a look at the following content tree structure
















The typical structure here is that an enterprise generally consists of one or more operating companies and each operating company manages its own digital content. 

Hence you see the content tree is structured similarly with “Reusable Content” typically housing taxonomy and component items that do not have presentation but can be reused at every level of the enterprise all the way down to an individual site. This way you introduce reusability, avoid duplication of content and keep some level of consistency with content.

What about IA? Well take a look at this:
















If you ask a developer, he/she would say that this is even more important that content hierarchy as this defines multiple inheritance which is a major feature of Sitecore. The idea here is simple
  • Start with creating base templates at the Enterprise level
  • Then create base templates for operating company that inherit from Enterprise Base Templates
  • Then create Site Base templates that inherit from operating company’s base templates

This is where the power of Sitecore lies. A change to the Site’s base template affect’s only the single site’s items. To implement a change to every site in the operating company you simply need to make a change to the operating company’s base template and similarly for change to every site in every op-co simple change the base template of the enterprise.


A similar approach can be taken for Sublayouts and Renderings to group them by Enterprise, Op-Co and Site.

I think this is a great way to get organized quickly in Sitecore. I hope you use take this approach and make it your own or feel free to suggest a better approach. This may seems very obvious to most but you will be surprised at the number of Sitecore instances out there struggling with scalability and reuse.