Navigating through a used book sale at my local public library, I was struck by the disorganization evident on each shelf. Cookbooks were stacked with children's books, and sci-fi novels were mixed with historical biographies. Once I rifled through the mess for about an hour, I finally came out with a novel I wanted - and a lot of wasted time.
Similar to how a library does not spend an extensive amount of time categorizing used book sales, many enterprises treat content management like a second-tier priority. In fact, the primary focus of most enterprises is to generate income, not to sift through and tag electronic documents for each employee, even though this process could save the company millions in lost revenue in the long run. 数据挖掘研究院
Consider that according to a U.S. job retention poll, 75 percent of employees are seeking new jobs.1 Furthermore, according to the U.S. Bureau of Labor Statistics, the average rate of employee turnover (in all non-farm companies in the U.S.) as a percentage of total employment is 3.3 percent.2 In a company of 1,000 employees, 33 will leave by the end of the year. This illustrates a nation of people on the move. Organizations must ask themselves, what happens to each employee's documents, as well as all of the intellectual capital they've generated for the company over the course of their tenure? Companies are left with a disorganized collection of data, similar to a library's stack of used books. 数据挖掘实验室
An autocategorization metadata system, the backbone of a successful content management system (CMS), is the solution for better search and retrieval. It not only improves accuracy and efficiency, but also saves time, money and resources. Any enterprise, despite the nature of its business, will capitalize on these benefits. The following examples demonstrate the importance of an automatic metatagging system.
The Homeland Security Digital Library
Launched in September 2005, The Homeland Security Digital Library (HSDL) is the primary online research tool for faculty and students of the Center for Homeland Defense and Security (CHDS). It operates out of the Naval Postgraduate School and sponsored by the Department of Homeland Security's (DHS) Office of Grants and Training (OG&T). 数据挖掘实验室
An electronic repository of scholarly works, relevant Web items and Department of Defense-written articles, the library receives a constant influx of new content. For example, when avian flu emerged, an entirely new set of metatags were created, and documents on biological outbreaks were updated with these new tags. With the time-sensitive nature of Homeland Security topics, this tagging and retagging must be done quickly and accurately. 数据挖掘研究院
With 200 to 300 documents added per day, it is neither practical nor efficient to do this work manually. The HSDL sought an automated process that allowed the library's technicians to use a workflow tool to edit metatags that were automatically generated, or add more rules, categories and taxonomies when new topics emerge. They found it in a semantic, rules-based model, the HSDL that now automatically categorizes documents saved in multiple formats for easy search and retrieval. Sample topics include law and justice, borders and immigration, infrastructure protection, terrorism and society, weapons and weapons systems, emergency management and public health. Each day, HSDL content developers continuously add new documents in PDF, video and audio formats to the system, ensuring that these documents are properly categorized and available to approved users. 数据挖掘研究院
The World Bank
The World Bank is a $20 billion global financial organization supporting 184 countries through financial or technical expertise to help reduce poverty. It is the world's largest funder of education, the fight against AIDS and worldwide corruption, and supports the basic needs of people in conflict. Much like the HSDL, the World Bank required a better way to organize and retrieve the millions of documents stored within its global repositories. By automatically applying metatags to documents in several different files, the World Bank created a system that also worked across multiple languages. 数据挖掘研究院
Locating the right information, in the right language, in real-time can be an immense challenge. The first step of the process is for these documents to be language-identified. When a document is read by a language ID program, it is automatically assigned a language project from one of many language dictionaries the bank has licensed. These include European, Eastern European, Asian and Arabic languages. The document is then ready for categorization, concept extraction and summarization.
Metadata tags are applied to documents based on pre-defined projects within the Bank's electronic infrastructure. This is often called data driving. The World Bank scours each document for more than 5,000 key words, applying them to 1,000 category classes. One example of a Bank category is "environment." Within that category, key words such as "bio diversity" or "pollution management" help to categorize that document. But not all words fit so easily into a category. "Contagion," the disease transmission noun, means something much different in the health category than it does in the financial category. Extracting the meaning requires a few more important steps. The document then runs through a content extractor, searching for key conceptual IDs, and a content summarizer identifies the most important and relevant sentences within the document.
Prior to automatic metatagging, World Bank personnel categorized three electronic documents per hour. Now the bank drives 50,000 PDF pages per hour through its platform, dramatically improving the processing rate while putting vital information into the hands of those who need it most, in real-time. 数据挖掘研究院
Metadata at Work
Using these examples, it is easy to see why in an enterprise with hundreds, even thousands of workers outputting knowledge, it may take years to tag each employee's electronic documents by hand. Organizations could struggle with the document's actual concept (this is what they were thinking when they created this document), but their own subjectivity may begin to tear away at the core meaning of what's contained in the work. In many cases, especially the case of knowledge management in an enterprise, the objectivity provided by metatagging software is essential to a project's success. 数据挖掘实验室
It is clear that the key to managing a company's content quickly and easily is being able to automatically generate metadata. Enterprises should treat their content the same way that online publishers do, creating metadata on the fly or the instant that content is added to the CMS. This prevents the unorganized collection of documents from piling up and enables businesses to operate more efficiently. 数据挖掘实验室
References:
- Wall Street Journal's CarrerJournal.com and the Society of Human Resource Professionals, December 2006.
- U.S. Bureau of Labor Statistics, February 2006.

