RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

Tutorial given at WWW-2005 and WISE-2005-Web Content Mining

来源: 作者:unkonwn 时间:2004-12-03 点击:

 

Web mining is a rapid growing research area. It consists of Web usage mining, Web structure mining, and Web content mining. Web usage mining refers to the discovery of user access patterns from Web usage logs. Web structure mining tries to discover useful knowledge from the structure of hyperlinks. Web content mining aims to extract/mine useful information or knowledge from web page contents. This tutorial focuses on Web Content Mining. 数据挖掘研究院

Web content mining is related but different from data mining and text mining. It is related to data mining because many data mining techniques can be applied in Web content mining. It is related to text mining because much of the web contents are texts. However, it is also quite different from data mining because Web data are mainly semi-structured and/or unstructured, while data mining deals primarily with structured data. Web content mining is also different from text mining because of the semi-structure nature of the Web, while text mining focuses on unstructured texts. Web content mining thus requires creative applications of data mining and/or text mining techniques and also its own unique approaches. In the past few years, there was a rapid expansion of activities in the Web content mining area. This is not surprising because of the phenomenal growth of the Web contents and significant economic benefit of such mining. However, due to the heterogeneity and the lack of structure of Web data, automated discovery of targeted or unexpected knowledge information still present many challenging research problems. In this tutorial, we will examine the following important Web content mining problems and discuss existing techniques for solving these problems. Some other emerging problems will also be surveyed. 数据挖掘研究院

  • Data/information extraction: Our focus will be on extraction of structured data from Web pages, such as products and search results. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction are covered.
  • Web information integration and schema matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. How to identify or match semantically similar data is a very important problem with many practical applications. Some existing techniques and problems are examined.
  • Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking. We will introduce a few tasks and techniques to mine such sources.
  • Knowledge synthesis: Concept hierarchies or ontology are useful in many applications. However, generating them manually is very time consuming. A few existing methods that explores the information redundancy of the Web will be presented. The main application is to synthesize and organize the pieces of information on the Web to give the user a coherent picture of the topic domain..
  • Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is interesting problem. A number of interesting techniques have been proposed in the past few years.

All these tasks present major research challenges and their solutions also have immediate real-life applications. The tutorial will start with a short motivation of the Web content mining. We then discuss the difference between web content mining and text mining, and between Web content mining and data mining. This is followed by presenting the above problems and current state-of-the-art techniques. Various examples will also be given to help participants to better understand how this technology can be deployed and to help businesses. All parts of the tutorial will have a mix of research and industry flavor, addressing seminal research concepts and looking at the technology from an industry angle.

Slides in PDF

数据挖掘研究院

References

  • Agrawal, R. and Srikant, R. Fast algorithm for mining association rules. VLDB-94, 1994.
  • Agrawal, R. and Srikant, R. On integrating catalogs. WWW-01, 2001.
  • Agrawal, R., Rajagopalan, S., Srikant, R., and Xu, Y. Mining newsgroups using networks arising from social behavior. WWW-03, 2003.
  • Arasu, A. and Garcia-Molina, H. Extracting Structured Data from Web Pages. SIGMOD-03, 2003.
  • Baeza-Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4):34-58, 1989.
  • Barton, G., Sternberg, M. A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J. Mol. Biol. 1987, 327-337.
  • Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW-02, 2002.
  • Brill, E. Some advances in rule-based part of speech tagging. AAAI-94, 1994.
  • Broder, A., Glassman, S., Manasse, M. and Zweig, G. Syntactic Clustering of the Web. WWW-6, 1997.
  • Bunescu, R., Mooney, R. Collective Information Extraction with Relational Markov Networks. ACL-04, 2004.
  • Mooney, R., and Bunescu, R. Mining Knowledge from Text Using Information Extraction. To appear in a special issue of SigKDD Explorations on Text Mining and Natural Language Processing, 2005.
  • Buttler, D., Liu, L., Pu, C. A fully automated extraction system for the World Wide Web. IEEE ICDCS-21, 2001.
  • Cai, D, Yu, S., Wen, J-R and Ma, W-Y. "Extracting Content Structure for Web Pages based on Visual Representation", Fifth Asia Pacific Web Conference (APWeb-03), 2003.
  • Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., Block-based web search. SIGIR-04. 2004
  • Carrillo, H., Lipman, D. The multiple sequence alignment problem in biology. SIAM J. Applied Math., 1988;48(5).
  • Chakrabarti, S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, 2002.
  • Chang, C. and Lui, S-L. IEPAD: Information extraction based on pattern discovery. WWW-10, 2001.
  • Chen, W. New algorithm for ordered tree-to-tree correction problem. Journal of Algorithms, 40:135.158, 2001.
  • Chriisment, C., Dousset, B, Karouach, S, Mothe, J. Information mining: extracting, exploring and visualising geo-referenced information. SIGIR-04 Workshop on Geograpghic information retrieval, 2004.
  • Cimiano, P., Handschuh, S., and Staab, S. Towards the self-annotating web. WWW-04, 2004.
  • Cohen, W., Hurst, M., and Jensen, L. A flexible learning system for wrapping tables and lists in HTML documents. WWW-02, 2002.
  • Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards automatic data extraction from large web sites. VLDB-01, 2001.
  • Cui, Hang, Min-Yen Kan and Tat-Seng Chua, Unsupervised Learning of Soft Patterns for Definitional Question Answering, Proceedings of the Thirteenth World Wide Web conference (WWW 2004), New York, May 17-22, 2004, pp. 90-99.
  • Das, S. and Chen, M. Yahoo! for Amazon: Extracting market sentiment from stock message boards. APFA-01, 2001.
  • Dave, K., Lawrence, S., and Pennock, D. Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews. WWW-03, 2003.
  • Doan, A., and Halevy, A., Semantic Integration Research in the Database Community: A Brief Survey. AI magazine, 2005.
  • Doan, A., Madhavan, J., Domingos, P., Halevy, A. Learning to map between ontologies on the semantic web. WWW-02, 2002.u
  • Embley, D., Jiang, Y and Ng, Y. .Record-boundary discovery in Web documents.. SIGMOD-99, 1999.
  • Etzioni, O., Cafarella, M., Downey, D., Kok, S. Popescu, A., Shaked, T., Soderland, S., Weld, S. Web-Scale Information Extraction in KnowItAll (Preliminary Results). WWW-2004.
  • Fellbaum, C. 1998. WordNet: an Electronic Lexical Database, MIT Press.
  • Freitag, D., and McCallum, A. Information extraction with HMM structures learned by stochastic optimization. AAAI-00, 2000.
  • Gruhl, D., Guha, R. Liben-Nowell, D. Tomkins, A. Information diffusion through blogspace. WWW-04, 2004,
  • Guha, R., Kumar, R., Raghavan, P., Tomkins, A. Propagation of trust and distrust. WWW-04, 2004.
  • Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P. DOM based Content Extraction of HTML Documents, WWW-03, 2003.
  • Gusfield, D. Algorithms on strings, tree, and sequence, Cambridge. 1997.
  • Hatzivassiloglou, V. and McKeown, K. Predicting the Semantic Orientation of Adjectives. ACL-97, 1997.
  • Hatzivassiloglou, V., and Wiebe, J. Effects of adjective orientation and gradability on sentence subjectivity. COLING-00, 2000.
  • He, B., Chang, K., Statistical Schema Matching across Web Query Interfaces. SIGMOD-03, 2003.
  • He, B., Chang, K., Han, J: Discovering complex matchings across web query interfaces: a correlation mining approach. KDD-04, 2004.
  • Hearst, M. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics, pages 539.545, 1992.
  • Hogeweg, P., Hesper, B. The alignment of sets of sequences and the construction of phylogenetic trees: An integrated method. J. Mol. Evol., 20, 175-186 (1984).
  • Hsu, C.-N. and Dung, M.-T. Generating finite-state transducers for semi-structured data extraction from the Web. Information Systems. 23(8): 521-538, 1998.
  • Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. KDD-04, 2004.
  • Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R. A hierarchical monothetic document clustering algorithm for summarization and browsing search results. WWW-04, 2004.
  • Kushmerick, N., Weld, D., and Doorenbos, R. Wrapper induction for information extraction. IJCAI-97, 1997.
  • Kushmerick, N. Wrapper Verification. WWW Journal 3, 2000.
  • Kushmerick, N. Regression testing for wrapper maintenance. AAAI-99, pp. 74-79, 1999.
  • Kushmerick, N. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15-68, 2000.
  • Lafferty, J., McCallum, A., Pereira, F. Conditional random fields: probabilistic models for segmenting and labeling or sequence data. ICML-01, 2001.
  • Lerman, K., Minton, S, Knoblock, C: Wrapper Maintenance: A Machine Learning Approach. J. Artif. Intell. Res. (JAIR) 18: 149-181, 2003.
  • Lerman, K., Getoor L., Minton, S. and Knoblock, C. .Using the Structure of Web Sites for Automatic Segmentation of Tables.. SIGMOD-04, 2004.
  • Leuski, A. and Allan, J. "Improving interactive retrieval by combining ranked lists and clustering". In Proceedings of RIAO-2000, pages 665-681, Paris, France, 2000
  • Li, X, Liu, B, Phang, T, and Hu, M. "Using Micro Information Unit for Internet Search," CIKM-2002, McLean, VA, Nov 5-9, 2002.
  • Lin, S and Ho, J. Discovering informative content blocks from Web documents. KDD-02, 2002.
  • Liu, B., and Chang, K. "Editorial: Special Issue on Web Content Mining" SIGKDD Explorations special issue on Web Content Mining, Dec, 2004.
  • Liu, B., Chin, C, and Ng, H. "Mining Topic-Specific Concepts and Definitions on the Web." WWW-03, 2003.
  • Liu, B., Grossman, R., and Zhai, Y. "Mining Data Records in Web Pages." KDD-03, 2003.
  • Liu, B and Zhai, Y. "NET - A System for Extracting Web Data from Flat and Nested Data Records." WISE-05, 2005.
  • Liu, B., Hsu, W., and Ma, Y. Integrating Classification and Association Rule Mining. KDD-98, 1998.
  • Liu, B., Hu, M and Cheng, J. "Opinion Observer: Analyzing and comparing opinions on the Web" WWW-05, May 10-14, 2005, in Chiba, Japan.
  • Liu, B., Ma, Y, and Yu, P. "Discovering unexpected information from your competitors′ Web sites." KDD-01, San Francisco, CA; Aug 20-23, 2001
  • Liu, B., Zhao, K and Yi, L. "Visualizing Web site comparisons." WWW-02. Honolulu, Hawaii, USA, 2002.
  • Maedche, A., and Staab, S. Ontology Learning for the Semantic Web. IEEE Intelligent Systems 16(2): 72-79, 2001.
  • Meng, X., Lu, H., Wang, H., and Gu, M. Schema-guided wrapper generator. ICDE-02, 2002.
  • Morinaga, S., Yamanishi, K., Tateishi, K., and Fukushima, T. Mining Product Reputations on the Web. KDD-02, 2002.
  • Muslea, I., Minton, S. and Knoblock, C. Active Learning for Hierarchical Wrapper Induction. AAAI-99, 1999: 975.
  • Muslea, I., Minton, S. and Knoblock, C. Selective Sampling with Co-Testing: Preliminary Results. AAAI-00, 2000.
  • Muslea, I., Minton, S. and Knoblock, C. .A hierarchical approach to wrapper induction.. Agents-99, 1999.
  • Nasukawa, T. and Yi, J. Sentiment analysis: Capturing favorability using natural language processing. Proceedings of the 2nd Intl. Conf. on Knowledge Capture (K-CA-03, 2003.
  • Nigam, K., and Hurst, M. Towards a Robust Metric of Opinion. AAAI Spring Symposium on Exploring Attitude and Affect in Text. 2004.
  • NLProcessor . Text Analysis Toolkit. 2000. http://www.infogistics.com/textanalysis.html
  • Noy, N, and Musen, M. PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment. AAAI-00, 2000.
  • Pang, B., Lee, L., and Vaithyanathan, S., Thumbs up? Sentiment Classification Using Machine Learning Techniques. EMNLP-02, 2002.
  • Pinto, D., McCallum, A., Wei, X. and Bruce, W. Table Extraction Using Conditional Random Fields. SIGIR-03, 2003.
  • Ramaswamy, L., Ivengar, A., Liu, L., and Douglis, F. Automatic detection of fragments in dynamically generated Web pages. WWW-04, 2004.
  • Reis, D. Golgher, P., Silva, A., Laender, A. Automatic Web news extraction using tree edit distance, WWW-04, 2004.
  • Riloff, E. and Wiebe, J. Learning extraction patterns for subjective expressions. EMNLP-03, 2003.
  • Rosenfeld, B., Feldman, R., Aumann, Y. Structural extraction from visual layout of documents. CIKM-02, 2002.
  • Song, R., Liu, H., Wen, J.-R., Ma, W.-Y. Learning block importance models for Web pages. WWW-04, 2004.
  • Tai, K. The tree-to-tree correction problem. J. ACM, 26(3):422.433, 1979.
  • Hogue, A. and Karger, D. Thresher: Automating the unwrapping of semantic content from the World Wide Web.. WWW-05, 2005.
  • Tong, R. An Operational System for Detecting and Tracking Opinions in on-line discussion. SIGIR 2001 Workshop on Operational Text Classification. 2001
  • Turney, P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. ACL-02, 2002.
  • Vaithyanathan, S., Dom, B. Model Selection in Unsupervised Learning with Applications To Document Clustering. ICML-99, 1999.
  • Valiente, G. Tree edit distance and common subtrees. Research Report LSI-02-20-R, Universitat Politecnica de Catalunya, Barcelona, Spain, 2002.
  • Wang, J., Wen, J-R, Lochovsky, F., Ma, W-Y. Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. VLDB-04, 2004.
  • Wang, J., and Lochovsky, F. Data extraction and label assignment for Web databases. WWW-03, 2003.
  • Wang, Y., and Hu, J. A machine learning based approach for table detection on the Web. WWW-02, 2002.
  • Wiebe, J., Bruce, R., and O.Hara, T. Development and Use of a Gold Standard Data Set for Subjectivity Classifications. ACL-99, 1999.
  • Wilson, T., Wiebe, J., and Hwa, R. Just how mad are you? Finding strong and weak opinion clauses. AAAI-04, 2004.
  • Wu, W, Yu, C, Doan, A., and Meng, W., An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web. SIGMOD-04, 2004.
  • Yang, W. Identifying syntactic differences between two programs. Softw. Pract. Exper., 21(7):739.755, 1991.
  • Yi, L., and Liu, B. "Web Page Cleaning for Web Mining through Feature Weighting" IJCAI-03, Aug 9-15, 2003, Acapulco, Mexico.
  • Yi, L., Liu, B., and Li, X. "Eliminating Noisy Information in Web Pages for Data Mining." KDD-2003, Washington, DC, USA, August 24 - 27, 2003.
  • Yin, X. and Lee, W-S. Using link analysis to improve layout on mobile devices. WWW-04, 2004.
  • Yu, H., and Hatzivassiloglou, V. Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. EMNLP-03, 2003.
  • Zamir, O. and Etzioni, O. Grouper: A Dynamic Clustering Interface to Web Search Results. WWW8, 1999.
  • Zeng, H-J., H-J..He, Q-C, Chen, Z., Ma, W-Y, Ma, J. Learning to cluster web search results. SIGIR-04, 2004.
  • Zhai, Y., and Liu, B. Web data extraction based on partial tree alignment. WWW-05, 2005.
  • Zhai, Y., and Liu, B. Extracting Web Data Using Instance-Based Learning. WISE-05, 2005.
  • Zhang, D., and Lee, W-S. Web taxonomy integration using support vector machines. WWW-04, 2004.
  • Zhao, H., Meng, W., Wu, Z., Raghavan, V. and Yu, C. Fully automatic wrapper generation for search engines.. WWW-05, 2005.
最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?