Your Next Data Warehouse? Size Up Column-Store Databases and Appliances

Data warehouses are growing fast. Nearly 40% of organizations say data volumes are increasing as much as 50% per year, while 18% say their warehouses are doubling in size annually, according to an IDC survey. To cope with the growth, the underlying databases have to be carefully tuned, but even then, many midsize and large companies now face six-, seven-, or even eight-figure upgrades of legacy data warehouses.

数据挖掘工具

Enter data warehouse appliances and column-store databases. Both have taken off in recent years, with venture capitalists placing bets on a slew of startups. They're safe bets, too, given the corporate desire to do more in-depth analysis of all available data.

"One of the reasons Wal-Mart, Staples, and Amazon.com have been so successful is that they analyze their data, and they have it at the fingertips of the entire enterprise," says Foster Hinshaw, CEO of upstart appliance vendor Dataupia and a co-founder of Netezza, the 8-year-old leader in the data warehouse appliance market. Companies must be able to drill into all their data to understand where to locate a new store, which products are selling, which need to be moved to different locations, and what programs to offer customers, he says. 数据挖掘工具

Boasting fast query performance, ease of deployment, and prices as low as $10,000 per terabyte, appliances have been wooing business away from the leading data warehouse vendors. It's no surprise that IBM and Teradata have responded with appliances of their own, or that Oracle has optimized reference configurations for third-party hardware. But the incumbents have yet to respond to the threat from column-store databases, which can deliver the industry's fastest query performance on complex analytic queries.

It sounds like a promising new era for data warehousing, but buyer beware. Appliances and column-store databases aren't always suitable replacements for a conventional enterprise data warehouse, or EDW. In fact, these alternatives are most often used for data marts that off-load data-intensive applications from the EDW, thereby avoiding (or at least delaying) the need to replace the main data warehouse. Whether you're looking for an analytic data mart or your next data warehouse, here's what to look for beyond the dazzling price, scalability, and performance claims.

 

数据挖掘实验室

Warehousing Alternatives
Vendor Product Column Store Database Data Warehouse Appliance
Calpont CNX Data Warehouse Platform
DATAllegro DATAllegro v3
Dataupia Dataupia Satori Server
Greenplum Greenplum Database G3
HP NeoView
IBM InfoSphere Balanced Warehouse E-Class
Infobright Brighthouse
Kognitio WX2
Netezza Netezza Performance Server
ParAccel* ParAccel Analytic
Sand Technology Sand/DNA Software
Sybase * Sybase IQ
Vertica* Vertica Database
* Column-store databases also offered in appliance configurations on third-party hardware
数据挖掘实验室

QUERIES BY COLUMN

Unlike a conventional database that stores data in rows, a column-store database looks at data vertically. In the case of a customer database, for example, the row-store database queries across each customer record, while the column-store database can query selected columns. The row approach is great for OLTP, with lots of writes for each new customer transaction. Column-store is ideal for OLAP, with lots of reads against particular attributes. If you want to explore sales by ZIP code and product, your query will interrogate only the ZIP code and product SKU columns rather than wading through names, addresses, and other irrelevant attributes. 数据挖掘研究院

Another advantage of the column-store database is that it can apply optimized compression--upwards of 10-to-1 compression--because the data in each column is consistent (all names, for example). As long as your query isn't record-intensive-- that is, it doesn't involve a lot of attributes--the amount of data coming from a column-store is always less than from a conventional database, so it delivers faster performance.

Column-store databases are ideal for many applications. Telecommunications companies have billions of call data records that need to be in an enterprise data warehouse for billing and historical purposes, but columnar extracts can be placed in a data mart for marketing analysis. Big retailers, banks, and insurance companies take much the same approach to look for up-sell and cross-sell opportunities, fraud, and mortality trends, respectively. 数据挖掘交友

The Internal Revenue Service launched one of the first column-store deployments in 1996, using Sybase IQ, in what began as a 2-TB warehouse. Back then, 2 TB was a very large database, says Jeff Butler, the IRS's director of research databases. The columnar database brought query times down from days to a matter of hours, "which was just a phenomenal boost in productivity," Butler says. "Today, we're at 158 TB, and the queries that once took hours now take seconds." 数据挖掘研究院

The IRS's analytic database stores population-level data (meaning no names or Social Security numbers) from every tax return over the last 10 years. The warehouse is used to explore filing and compliance behavior, as well as the potential impact of tax code changes. 数据挖掘研究院

The IRS went with a column-store database because it didn't know what data elements it would need to query from one question to the next, says Butler. With a static reporting application, you can use just about any database because you can do a lot of optimization, he says, "but that's not our situation."

数据挖掘交友

 

Impact Assessment: Column Store Databases and Warehouse Appliances

PERFORMANCE IN A BOX 数据挖掘论坛

What most data warehouse appliances have in common, whether built on row- or column-store databases, is a massively parallel processing, shared-nothing architecture. MPP means that the query load is spread across many processors, or nodes, usually on commodity hardware running Linux. Shared nothing means that each node is independent, with its own memory and storage. The result is high performance without the expense of the high-powered, symmetric multiprocessor servers that typically run conventional data warehouses.

数据挖掘论坛

Appliances also are gaining ground because they're easier to deploy and maintain than conventional warehouses, which have to be tuned, optimized, and, lately, clustered to perform in large-scale deployments. Trading on this appeal, column-store database vendors, including ParAccel, Sybase, and Vertica, have introduced software-hardware bundles built on third-party hardware. 数据挖掘实验室

In the case of Teradata, which was the first to bundle hardware and software in an MPP, shared-nothing architecture (without calling it an appliance), the upstarts are competing primarily on price. Teradata responded last month with its own appliances--one for data marts and one for small warehouses--while also upgrading the performance and scalability of its core EDW product.

数据挖掘论坛

WEIGH THE OPTIONS

When evaluating an appliance or column-store database, consider whether you're replacing the EDW or off-loading complex, data-intensive analytic queries to improve performance, delay an enterprise data warehouse upgrade, or both. If it's the former, keep in mind that column-store products aren't suitable for row-intensive queries involving many attributes. Does your EDW have to support thousands of users with a mix of query types? Not all appliances can do that. Teradata's flagship product, IBM's InfoSphere Balanced Warehouse E-Class, Hewlett-Packard's Neoview, DATAllegro v3, and Greenplum's G3 are all technically suitable for EDW deployment, though HP, DATAllegro, and Greenplum have to use partners to match Teradata's or IBM's data integration and information management software. 数据挖掘研究院

Appliances and column-store databases are great for analytic data marts and focused warehouses tackling complex queries and extreme data volumes. Case in point, NYSE Euronext, the parent company of the New York Stock Exchange, has replaced three aging 100-TB warehouses built on Oracle databases with three Netezza Performance Servers. Complex queries that took as long as 26 hours in the conventional warehouse now take two and a half minutes, while simple queries that took seven minutes now take five seconds, says Steve Hirsch, chief data officer.

However, these aren't typical EDWs. Loads are limited to hundreds of queries per day, he says, and the user community tops out at 150 per appliance, with only 20 concurrent users on a device. One of the broadest Netezza deployments has more than 600 users and 20,000 queries per day, Netezza says, but it's not unusual for large EDWs to have tens of thousands of concurrent users and hundreds of thousands of queries a day. 数据挖掘研究院

While not massive in terms of its user base, Euronext does do complex analytic calculations. "It's very possible that we could hit 40 to 50 TB of data in a single query," Hirsch says.

数据挖掘交友

Once you understand your data warehousing needs, including the number of users and the depth, diversity, and volume of queries, you'll be ready to wade through all the confusing vendor claims. And when you get down to the finalists, insist on a proof-of-concept deployment.

数据挖掘实验室

"Don't trust the FUD or the TCP-H benchmark statistics or anything else," Gartner analyst Donald Feinberg says. "You need to put your data out there and run your queries and see which alternative is faster in your application." That's the only way to know for sure that you've made the right choice.

 

  数据挖掘论坛

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:用TOMCAT做连接池,发生这样的错误是什么原因
下一篇:数据库设计中的14个关键技巧
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • SQL与最短路径算法
  • 求一个数据库备份方案
  • 某商店数据仓库的原型分析和设计
  • 移动通信数据仓库联合实验室在北京成立
  • 数据仓库的规划构建策略
  • NCR Teradata数据仓库概述
  • 各位进来帮忙参考一下关于个人发展方向问题
  • 关于数据仓库的数据模型
  • 第五届机器学习及其应用研讨会日程表
  • 数据库归来——下一代数据库扫描简介
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 处理海量数据的经验和技巧
  • 数据仓库的新生
  • 什么是ETL
  • Data Warehousing for the Midsize Organiz
  • Data warehouse management strategies for
  • 第五届机器学习及其应用研讨会日程表
  • SQL Data Warehouse Analyst
  • Edge appliances and the evolution of dat
  • 动态数据仓库让BI走向一线
  • The OLAP Report
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静