Showing posts with label Data. Show all posts
Showing posts with label Data. Show all posts

Wednesday, February 26, 2014

I see data visualizations... Power BI, Power Map and Power Q&A [Oh my]

SQL Server Blog - Data Visualizations

A couple of weeks back was a really exciting time for us. Less than a year after we released Office 365 for Businesses, we announced the general availability of Power BI for Office 365. You may have read previous blog articles by Quentin Clark on Making Big Data Work for Everyone and Kamal Hathi on Simplifying Business Intelligence through Power BI for Office 365. In this article, we’ll outline how we think about visualizations.

Why Visualizations Matter

While a list of items is great for entering or auditing data, data visualizations are a great way to distill information to what matters most that is understandable quickly.

...

Visualizations in Productivity Apps

We have the privilege of having the largest community of users of productivity applications in the world. Thanks...

...

Faster Creation of Visualizations

Excel 2007 introduced the ability to set the style of a chart with one click and leverage richer graphics such as shadows, anti-aliased lines, and transparency.

Office 2013 was one of our most ground-breaking releases.

...

Richer Interactivity

Part of my role at Microsoft involves presenting on various topics to stakeholders, and increasingly most of these include data visualizations. Only a few years back, I remember ...

...

Visualizations on All Data

In addition, both data volumes and the types of data customers want to visualize have expanded as well.

Excel 2013 also introduced the Data Model, opening the door for workbooks that contained significantly larger datasets than before, with richer way to express business logic directly within the workbook.

Increasingly, we have access to geospatial data, and recently introduced Power Map brings new 3D visualization tool for mapping, exploring, and interacting with geographical and temporal data to Excel, enabling people to discover and share new insights such as trends, patterns, and outliers in their data over time...

...

We are very excited to have introduced Power Q&A as part of the Power BI launch. This innovative experience makes it even easier to understand your data by providing a natural language experience that interprets your question and immediately serves up the correct answer on the fly in the form of an interactive chart or graph. These visualizations change dynamically as you modify the question, creating a truly interactive experience with your data.

image

...

image

Visualizations Everywhere

As customers are creating insights and sharing them, we have also invested in ensuring SharePoint 2013 and Office 365 provide full fidelity rendering as the desktop client so their products remain beautiful wherever it’s consumed.

What’s Next?

..."

The Power Q&A looks interesting. I'd love to be able to provide that kind of thing in my apps. But lets see how it plays out over a version or two...

 

Related Past Post XRef:
Going with the GeoFlow for Excel 2013... Free 3D visualization add-in for mapping, exploring, and interacting with geographical/temporal data

Friday, January 17, 2014

SELECT * FROM StackExchange. There's the easy way and the hard, yet much more data fun, way...

Brent Ozar - How to Query the StackExchange Databases

During next week’s Watch Brent Tune Queries webcast, I’m using my favorite demo database: Stack Overflow. The Stack Exchange folks are kind enough to make all of their data available via BitTorrent for Creative Commons usage as long as you properly attribute the source.

There’s two ways you can get started writing queries against Stack’s databases – the easy way and the hard way.

The Easy Way to Query StackOverflow.com

Point your browser over to Data.StackExchange.com and the available database list shows the number of questions and answers, plus the date of the database you’ll be querying:

...

The Hard Way to Query StackOverflow.COM

First, you’ll need to download a copy of the most recent XML data dump. These files are pretty big – around 15GB total – so there’s no direct download for the entire repository. There’s two ways you can get the September 2013 export:

I strongly recommend working with a smaller site’s data first like DBA.StackExchange. If you decide to work with the monster StackOverflow.com’s data, you’re going to temporarily need:

  • ~15GB of space for the download
  • ~60GB after the StackOverflow.com exports are expanded with 7zip. They’re XML, so they compress extremely well for download, but holy cow, XML is wordy.
  • ~50GB for the SQL Server database (and this will stick around)

Next, you need a tool to load that XML into the database platform of your choosing. For Microsoft SQL Server, I use Jeremiah’s improved version of the old Sky Sanders’ SODDI. Sky stopped updating his version a few years ago, and it’s no longer compatible with the current Stack dumps. Jeremiah’s current download is here, and it works with the September 2013 data dump.

...

image

...

Why Go to All This Work?

When I’m teaching performance tuning of queries and indexes, there’s no substitute for a local copy of the database. I want to show the impact of new indexes, analyze execution plans with SQL Sentry Plan Explorer, and run load tests with HammerDB.

That’s what we do in our SQL Server Performance Troubleshooting class – specifically, in my modules on How to Think Like the Engine, What Queries are Killing My Server, T-SQL Anti-patterns, and My T-SQL Tuning Process. Forget AdventureWorks – it’s so much more fun to use real StackOverflow.com data to discover tag patterns, interesting questions, and helpful users.

A great resource, both Brent's post and of course the data for when you need some "safe" data, yet in a large enough volume to be meaningful...

 

Related Past Post XRef:
Stacks and stacks of data - Your copy of the Stack Overflow’s (and family) public data is a download away

The Stack Family (StackOverflow, SuperUser, etc) gets OData’d via Stack Exchange Data Explorer
Build something awesome with the new StackExchange v2 API and win something awesome...
Stacking up the Open Source Projects, Stack Exchange is...

Tuesday, November 19, 2013

A word or two or 10 about Word Clouds

Beyond Search - Easily Generate Your Own Word Clouds

Word clouds have become inescapable, and it is easy to see why– many people find such a blending of text and visual information easy to understand. But how, exactly, can you generate one of these content confections? Smashing Apps shares its collection of “10 Amazing Word Cloud Generators.”

...

VocabGrabber is different. It doesn’t even make a particularly pretty picture. As the name implies, VocabGrabber uses your text to build a list of vocabulary words, complete with examples of usage pulled from directly from the content. This could be a useful tool for students, or anyone learning something new that comes with specialized terminology. If your learning materials are digital, a simple cut-and-paste can generate a handy list of terms and in-context examples. A valuable find in a list full of fun and useful tools.

Smashing Apps - 10 Amazing Word Cloud Generators

Smashing Apps has been featured at Wordpress Showcase. If you like Smashing Apps and would like to share your love with us so you can click here to rate us.

In this session, we are presenting 10 amazing word cloud generators for you. Word cloud can be defined as a graphical representation of word frequency, whereas word cloud generators simply are the tools to map data, such as words and tags in a visual and engaging way. These generators come with different features that include different fonts, shapes, layouts and editing capabilities.

Without any further ado, here we are presenting a fine collection of 10 amazing and useful word cloud generators for you. Leave us a comment and let us know what you think of the proliferation of design inspiration in general on the web. Your comments are always more than welcome. Let us have a look. Enjoy!

 

image

Make sure you click through as SmashingApps has done a great job with blurbs and snap for each one.

 

Related Past Post XRef:
Wordle’ing Terms of Service Agreements – How a ToS would look as a word/tag cloud
Bipin shows us that creating a tag cloud doesn't have to be hard to do (in ASP.Net)
Interactive WinForm Tag Cloud Control (Think “Cool, I can add a Word/Tag Cloud thing to my WinForm app!”)
"WordCloud - A Squarified Treemap of Word Frequency" - Something like this would be cool in a Feed Reader...
Feed Stream Analysis - Web Feed/Post Analysis to Group Like/Related Posts
WordNet
"Statistical parsing of English sentences"
"A Model for Weblog Research"

Tuesday, November 12, 2013

"The Field Guide to Data Science" Free eBook of the Day (Think "The non-Scientist Guide Data Science")

Booz Allen Hamilton - The Field Guide to Data Science

image

Understanding the DNA of Data Science

Data Science is the competitive advantage of the future for organizations interested in turning their data into a product through analytics. Industries from health, to national security, to finance, to energy can be improved by creating better data analytics through Data Science. The winners and the losers in the emerging data economy are going to be determined by their Data Science teams.

Booz Allen Hamilton created The Field Guide to Data Science to help organizations of all types and missions understand how to make use of data as a resource. The text spells out what Data Science is and why it matters to organizations as well as how to create Data Science teams. Along the way, our team of experts provides field-tested approaches, personal tips and tricks, and real-life case studies. Senior leaders will walk away with a deeper understanding of the concepts at the heart of Data Science. Practitioners will add to their toolboxes.

In The Field Guide to Data Science, our Booz Allen experts provide their insights in the following areas:

  • Start Here for the Basics provides an introduction to Data Science, including what makes Data Science unique from other analysis approaches. We will help you understand Data Science maturity within an organization and how to create a robust Data Science capability.
  • Take Off the Training Wheels is the practitioners guide to Data Science. We share our established processes, including our approach to decomposing complex Data Science problems, the Fractal Analytic Model. We conclude with the Guide to Analytic Selection to help you select the right analytic techniques to conquer your toughest challenges.
  • Life in the Trenches gives a first hand account of life as a Data Scientist. We share insights on a variety of Data Science topics through illustrative case studies. We provide tips and tricks from our own experiences on these real-life analytic challenges.
  • Putting it All Together highlights our successes creating Data Science solutions for our clients. It follows several projects from data to insights and see the impact Data Science can have on your organization.

...

image

imageimage

When I first saw this title, I thought it was going to be one of the make my brain hurt kind of books, but heck, even I can read it! It's actually not dry and is kind of entertaining! If you have "data" (and who doesn't anymore) this free ebook might a good read for you. And really, it won't make your brain explode...

 

(via KDNuggets - Booz Allen "Field Guide to Data Science" - free download)

Friday, October 04, 2013

OpenGov.com, where your Local Government can get naked...(well, as in Budget Transparency, that is)

OpenGov.com - Simi Valley

image

image

What an awesome way to grok my home town's budget. While you'd think "budget = boring" this sight makes it actually fun to look at, explore and spelunk the budget. It's very eye opening to see where all the money is going...

Thursday, September 26, 2013

Get a big jump into Big Data with the "Getting Started with Microsoft Big Data" series

Channel 9 - Getting Started with Microsoft Big Data

Developers, take this course to get an overview of Microsoft Big Data tools as part of the Windows Azure HDInsight and Storage services. As a developer, you'll learn how to create map-reduce programs and automate the workflow of processing Big Data jobs. As a SQL developer, you'll learn Hive can make you instantly productive with Hadoop data.

image

Added to the billion and one of things I need to learn ASAP. When I find the time and "want to" this series looks like a great way to get started. I've done a tiny bit of hadoop, and I already know I'm going to need all the help I can get up this learning curve...

Monday, August 26, 2013

Cool LA Metro Rail Ridership Visualization (and developer.metro.net news too)

LA Metro Ridership

image

image

(via reddit/LosAngeles - Metro Rail Network Ridership - class project from last spring I've wanted to show off for awhile (crossposting from dataisbeautiful))

Also of note:

http://developer.metro.net/

APIs / Feeds / Data

image

Developer Resources

Welcome to Metro’s developer site – this is a website for technical individuals and entities who are using transportation and multi-modal data in interesting ways. Since first releasing our transit data in the summer of 2009, numerous developers have incorporated our data into their applications — you can see a list of  featured applications here.

New Items!

Getting Started

Become a member: Joining is FREE and will allow you to comment and have direct communication with our developers responsible for each data sets.

Get an API Key: You must have a valid API Key to utilize the Trip Planner Information Feed. You will be assigned a key at registration. Check your profile page to retrieve your API key.

Read the Policies: Please familiarize yourself with our Terms and Conditions, and Policies for using the various data and this website.

Read the Trip Planner Information Feed documentation: The web services offers data from 65+ Southern California transit agencies.

Read the FAQ: Questions about this site, the data or tools needed to utilize the data.

Monday, August 19, 2013

Fuzzy Lookup Add-In for Excel (Insert lame "Fuzzy, wuzzy was an Excel..." snip here)

Microsoft Downloads - Fuzzy Lookup Add-In for Excel

The Fuzzy Lookup Add-In for Excel performs fuzzy matching of textual data in Excel.

Version: 1.0.0.0

Date Published: 8/16/2013

FuzzyLookupAddInForExcel.zip, 1.5 MB

The Fuzzy Lookup Add-In for Excel was developed by Microsoft Research and performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables. The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and added/missing data. For instance, it might detect that the rows “Mr. Andrew Hill”, “Hill, Andrew R.” and “Andy Hill” all refer to the same underlying entity, returning a similarity score along with each match. While the default configuration works well for a wide variety of textual data, such as product names or customer addresses, the matching may also be customized for specific domains or languages.

Supported Operating System

Windows 7, Windows Server 2008, Windows Vista

  • Preinstalled Software (Prerequisites): Microsoft Excel 2010
  • ...

Sounds like something I might be able to use... Now it would be even better if this were a .Net assembly that I could use. Will have to look at this and see what my programming options are...

Wednesday, July 31, 2013

Opening the U.S. Code, does the U.S. House, release in XML it does...

E Pluribus Unum - U.S. House of Representatives publishes U.S. Code as open government data

Three years on, Republicans in Congress continue to follow through on promises to embrace innovation and transparency in the legislative process. Today, the United States House of Representatives has made the United States Code available in bulk Extensible Markup Language (XML).

“Providing free and open access to the U.S. Code in XML is another win for open government,” said Speaker John Boehner and Majority Leader Eric Cantor, in a statement posted to Speaker.gov. “And we want to thank the Office of Law Revision Counsel for all of their work to make this project a reality. Whether it’s our ‘read the bill’ reforms, streaming debates and committee hearings live online, or providing unprecedented access to legislative data, we’re keeping our pledge to make Congress more transparent and accountable to the people we serve.”

House Democratic leaders praised the House of Representatives Office of the Law Revision Counsel (OLRC) for the release of the U.S. Code in XML, demonstrating strong bipartisan support for such measures.

“OLRC has taken an important step towards making our federal laws more open and transparent,” said Whip Steny H. Hoyer, in a statement.

...

“Just this morning, Josh Tauberer updated our public domain U.S. Code parser to make use of the new XML version of the US Code,” said Mill. “The XML version’s consistent design meant we could fix bugs and inaccuracies that will contribute directly to improving the quality of GovTrack’s and Sunlight’s work, and enables more new features going forward that weren’t possible before. The public will definitely benefit from the vastly more reliable understanding of our nation’s laws that today’s XML release enables.” (More from Tom Lee at the Sunlight Labs blog.)

...

“Last year, we reported that House Republicans had the transparency edge on Senate Democrats and the Obama administration,” he said. “(House Democrats support the Republican leadership’s efforts.) The release of the U.S. Code in XML joins projects like docs.house.gov and beta.congress.gov in producing actual forward motion on transparency in Congress’s deliberations, management, and results.

For over a year, I’ve been pointing out that there is no machine-readable federal government organization chart. Having one is elemental transparency, and there’s some chance that the Obama administration will materialize with the Federal Program Inventory. But we don’t know yet if agency and program identifiers will be published. The Obama administration could catch up or overtake House Republicans with a little effort in this area. Here’s hoping they do.”

House of Representatives - US Code Most Current Release Point

Public Law 113-21
(Titles in bold are updated at this release point)

Information about the currency of United States Code titles is available on the Currency page.

USC in XML

The United States Code in XML uses the USLM Schema. That schema is explained in greater detail in the USLM Schema User Guide. For rendering the XML files, a Stylesheet (CSS) file is provided.

Each update of the United States Code is a "release point". This page contains links to downloadable files for the most current release point. The available formats are XML, XHTML, and PCC (photocomposition codes, sometimes called GPO locators). Certain limitations currently exist. Although older PDF files (generated through Microcomp) are available on the Annual Historical Archives page, the new PDF files for this page (to be generated through XSL-FO) are not yet available. In addition, the five appendices contained in the United States Code are not yet available in the XML format.

Links to files for prior release points are available on the Prior Release Points page. Links to older files are available on the Annual Historical Archives page.

image

While pretty cool, I was expecting something different. Seems the XML is really pretty much XHTML. So while it IS XML, it's still a display markup schema...

image

Guess we'll have to wait for this to complete, Legislative Data Challenge - Win $5k challenge by helping the Library of Congress make US laws machine readable.... Still I applaud the effort!

 

Related Past Post XRef:
Legislative Data Challenge - Win $5k challenge by helping the Library of Congress make US laws machine readable...
From A to W... The US Gov goes Git (and API crazy too). There's an insane about of data, API's and OSS projects from the US Government...

Monday, July 29, 2013

Building big bucks with big data... "Big Data, Analytics, and the Future of Marketing & Sales" Free eBook (With audio & video)

McKinsey - Chief Marketing & Sales Officer Forum - eBook: Big Data, Analytics, and the Future of Marketing & Sales

image

The goldmine of data available today represents a turning point for marketing and sales leaders

Table of Contents

Introduction

  • Putting big data and advanced analytics to work (& Article)

Business Opportunities

  • Use Big Data to find new micromarkets (Article)
  • Value of big data and advanced analytics (Video)
  • Big data, better decisions (Presentation)
  • Marketing’s $200 billion opportunity (Video)
  • Smart analytics: How marketing drives short-term and long-term growth (Article)
  • Know your customers wherever they are (Article)

Insights and action

  • Five steps to squeeze more ROI from your marketing (Article)
  • Case: advanced analytics disproves common wisdom (Video)
  • Getting to “the price is right” (Article)
  • Gilt Groupe: Using Big Data, mobile, and social media to reinvent shopping (Interview)
  • Under the retail microscope: Seeing your customers for the first time (Article)
  • The sales science behind Big Data (Video)
  • Name your price: The power of Big Data and analytics (Article)
  • Data: The real promise of social/local/mobile (Video)
  • Getting beyond the buzz: Is your social media working? (Article)
  • Big Data & advanced analytics: Success stories from the front lines (Article)

How to get organized and get started

  • Get started with Big Data: Tie strategy to performance (Article)
  • What you need to make Big Data work: The pencil (Article)
  • Need for speed: Algorithmic marketing and customer data overload (Article)
  • Simplify Big Data – or it’ll be useless for sales (Article)
  • The challenges of harnessing big data to better understand customers (Video)
  • Contributors
  • Connect with us

Not a dev thing, but still, big data is big, right?

Wednesday, July 17, 2013

Legislative Data Challenge - Win $5k challenge by helping the Library of Congress make US laws machine readable...

Nextgov - Contest Aims to Make Proposed U.S. Laws Machine Readable Worldwide

The Library of Congress is crowdsourcing an initiative to make it easier for software programs around the world to read, understand and categorize federal legislation.

The library is offering a $5,000 prize to the Challenge.gov contestant whose entry best fits U.S. legislation into Akoma Ntoso, an internationally-developed framework that aims to be the standard for presenting legislative data in machine-readable formats.

...

News from the Library of Congress - Library of Congress Announces Legislative Data Challenge

The Library of Congress, at the request of the U.S. House of Representatives, is utilizing the Challenge.gov platform to advance the exchange of legislative information worldwide.

Akoma Ntoso (www.akomantoso.org) is a framework used in many other countries around the world to annotate and format electronic versions of parliamentary, legislative and judiciary documents. The challenge, "Markup of U.S. Legislation in Akoma Ntoso", invites competitors to apply the Akoma Ntoso schema to U.S. federal legislative information so it can be more broadly accessed and analyzed alongside legislative documents created elsewhere.

"The Library works closely with the Congress and related agencies to make America’s federal legislative record more widely available through Congress.gov," said Robert Dizard Jr., Deputy Librarian of Congress. "This challenge will build on that accessibility goal by advancing the possibilities related to international frameworks. American legislators, analysts, and the public can benefit from international standards that reflect U.S. legislation, thereby allowing better comparative legislative information. We are initiating this effort as people around the world are working to share legislative information across nations and other jurisdictions."

Utilizing U.S. bill text, challenge participants would attempt to markup the text into electronic versions using the Akoma Ntoso framework. Participants will be expected to identify any issues that appear when applying the Akoma Ntoso schema to U.S. bill text, recommend solutions to resolve those issues, and provide information on the tools used to create the markup.

The challenge, which opened today and closes Oct. 31, 2013, is extended to participants 18 years of age or older. For the official rules and more detailed information about the challenge or to enter a submission, visit akoma-ntoso-markup.challenge.gov.

The competition’s three judges are experts in either U.S. legislation XML standards or the Akoma Ntoso legal schema. The Library of Congress will announce the winner of the $5,000 prize on Dec. 19, 2013.

...

Akoma Ntoso

Akoma Ntoso (“linked hearts” in Akan language of West Africa) defines a “machine readable” set of simple technology-neutral electronic representations (in XML format) of parliamentary, legislative and judiciary documents.

Akoma Ntoso  XML schemas make “visible” the structure and semantic components of relevant digital documents so as to support the creation of high value information services to deliver the power of ICTs to increase efficiency and accountability in the parliamentary, legislative and judiciary contexts.

Akoma Ntoso is an initiative of "Africa i-Parliament Action Plan" (www.parliaments.info) a programme of UN/DESA.

SNAGHTML2a767e42

I'm trying really hard to be supportive of this and not be snarky (like at least with this, something will read the laws congress passes... OH darn, see what I mean? ;)

Monday, July 15, 2013

Gestalt your way to better data visualization by following the Gestalt Laws

Six Revisions - How to Make Data Visualization Better with Gestalt Laws

People love order. We love to make sense of the world around us.

The human mind’s affinity for making sense of the objects it sees can be explained in a theory called Gestalt psychology. Gestalt psychology, also referred to gestaltism, is a set of laws that accounts for how we perceive or intuit patterns and conclusions from the things we see.

These laws can help designers produce better designs. For instance:

In this guide, we will talk about how to apply the principles of Gestalt to create better charts, graphs, and data visualization graphics.

For broader implementation tips of Gestalt laws, please read Gestalt Principles Applied in Design.

Introduction

Gestalt laws originate from the field of psychology. Today, however, this set of laws finds relevance in a multitude of disciplines and industries like design, linguistics, musicology, architecture, visual communication, and more.

These laws provide us a framework for explaining how human perception works.

Understanding and applying these laws within the scope of charting and data visualization can help our users identify patterns that matter, quickly and efficiently.

None of the Gestalt laws work in isolation, and in any given scenario, you can find the interplay of two or more of these laws.

Let us cover some of the Gestalt laws that are relevant to enhancing data visualization graphics.

...

Summary

To sum up the lessons we can derive from these Gestalt laws:

  1. Law of Pr├Ągnanz: Keep it simple. Arrange data logically wherever possible.
  2. Law of Continuity: Arrange objects in a line to facilitate grouping and comparison.
  3. Law of Similarity: Use similar characteristics (color, size, shape, etc.) to establish relationships and to encourage groupings of objects.
  4. Law of Focal Point: Use distinctive characteristics (like a different color or a different shape) to highlight and create focal points.
  5. Law of Proximity: Know what your chart’s information priority is, and then create groupings through proximity to support that priority.
  6. Law of Isomorphic Correspondence: Keep in mind your user and their preconceived notions and experiences. Stick to well-established conventions and best practices.
  7. Law of Figure/Ground: Ensure there is enough contrast between your foreground and background so that charts and graphs are more legible.
  8. Law of Common Fate: Use direction and/or movement to establish or negate relationships.

imageimageimageimage

The title of my post should have been "Break the Gestalt Laws, go directly to the Data Visualization jail, do not..." Anyway, great write up, advice and guidance...

Thursday, July 11, 2013

A little Hadoop, HDInsight, Mahout, some .Net and a little StackOverflow and you have...

Amazedsaint's Tech Journal - Building A Recommendation Engine - Machine Learning Using Windows Azure HDInsight, Hadoop And Mahout

Feel like helping some one today?

Let us help the Stack Exchange guys to suggest questions to a user that he can answer, based on his answering history, much like the way Amazon suggests you products based on your previous purchase history.  If you don’t know what Stack Exchange does – they run a number of Q&A sites including the massively popular Stack Overflow. 

Our objective here is to see how we can analyze the past answers of a user, to predict questions that he may answer in future. May Stack Exchange’s current recommendation logic may work better than ours, but that won’t prevent us from helping them for our own  learning purposes .

We’ll be doing the following tasks.

  • Extracting the required information from Stack Exchange data set
  • Using the required information to build a Recommender

But let us start with the basics.   If you are totally new to Apache Hadoop and Hadoop On Azure, I recommend you to read these introductory articles before you begin, where I explain HDInsight and Map Reduce model a bit in detail.

...

Conclusion In this example, we were doing a lot of manual work to upload the required input files to HDFS, and triggering the Recommender Job manually. In fact, you could automate this entire work flow leveraging Hadoop For Azure SDK. But that is for another post, stay tuned. Real life analysis has much more to do, including writing map/reducers for extracting and dumping data to HDFS, automating creation of hive tables, perform operations using HiveQL or PIG, etc. However, we just examined the steps involved in doing something meaningful with Azure, Hadoop and Mahout.

You may also access this data in your Mobile App or ASP.NET Web application, either by using Sqoop to export this to SQL Server, or by loading it to a Hive table as I explained earlier. Happy Coding and Machine Learning!! Also, if you are interested in scenarios where you could tie your existing applications with HD Insight to build end to end workflows, get in touch with me. -

imageimageimageimage

Just the article I've been looking for. It provides a nice start to finish view of playing with HDInsight and Mahout, which is something I was pulling my hair out over a few months ago...

Thursday, June 13, 2013

Getting into the flow, surfing restaurant inspections with GeoFlow and Microsoft Data Explorer (Think "Web Data + Excel + 3D = Good Food")

Microsoft Business Intelligence - Surfing Restaurant Inspections with Microsoft Data Explorer and GeoFlow

Father’s Day is approaching and you might be thinking about a good place to have a nice lunch with your Dad… We would like to show you how Data Explorer and Geoflow can help you gather some insights to make a good decision.

In order to achieve this, we will look at publicly available data about Food Establishment Inspections for the past 7 years and we will also leverage the Yelp API to bring ratings and reviews for restaurants. For the purpose of this post, we will focus on the King County area (WA) but you can try to find local data about Food Establishment inspections for your area too.

What you will need:

What you will learn in this post:

  • Import data from the Yelp Web API (JSON) using Data Explorer.
  • Import public data about Food Establishment Inspections from a CSV file.
  • Reshape the data in your queries.
  • Parameterize the Yelp query by turning it into a function, using the Data Explorer formula language, so you can reuse it to retrieve information about different types of restaurants as well as different geographical locations.
  • Invoke a function given a set of user-defined inputs in an Excel table.
  • Combine (Merge) two queries.
  • Load the final query into the Data Model.
  • Visualize the results in Geoflow.

...

image

You know you want to play with this... Just admit it. Makes me want to install Office 2013 just so I can...  :)

 

Related Past Post XRef:
Going with the GeoFlow for Excel 2013... Free 3D visualization add-in for mapping, exploring, and interacting with geographical/temporal data

Friday, May 24, 2013

From A to W... The US Gov goes Git (and API crazy too). There's an insane about of data, API's and OSS projects from the US Government...

Nextgov - White House Releases New Tools for Digital Strategy Anniversary

The White House marked the one-year anniversary of its digital government strategy Thursday with a slate of new releases, including a catalog of government APIs, a toolkit for developing government mobile apps and a new framework for ensuring the security of government mobile devices.

Those releases correspond with three main goals for the digital strategy: make more information available to the public; serve customers better; and improve the security of federal computing.

...

DATA.Gov - Developer Resources

image

SNAGHTML524ca5b

image

Government Open Source Projects

SNAGHTML52bfd9a

image

imageimageimageimageimageimageimage

That list of API's and projects just blows my mind... I mean... wow. If you're looking to wander through some code, there HAS to be something here that you'll find interesting. There's something for every language, platform and interest, I think...

 

Related Past Post XRef:
Happy Birthday Data.gov. You’ve grown so in the last year… (from 47 to 272,677 datasets)

Thursday, May 09, 2013

And Data for All... President Obama signs Executive Order to make government-held data more accessible (in machine readable form by default)

The White House Blog - Landmark Steps to Liberate Open Data

Today, as he heads to Austin, Texas, for a Middle Class Jobs and Opportunity Tour, President Obama signed an Executive Order directing historic steps to make government-held data more accessible to the public and to entrepreneurs and others as fuel for innovation and economic growth. The Executive Order declares that information is a valuable resource and strategic asset for the Nation. We couldn’t agree more.

Under the terms of the Executive Order and a new Open Data Policy released today by the Office of Science and Technology Policy and the Office of Management and Budget, all newly generated government data will be required to be made available in open, machine-readable formats, greatly enhancing their accessibility and usefulness, while ensuring privacy and security.

...

More information:

image

We're making a lot more data open to the public [Received Email]

Hi, all --

Earlier today, President Obama signed an Executive Order directing his administration to take historic steps to make government-held data more accessible to the public and to entrepreneurs and others as fuel for innovation and economic growth.

Here's what you need to know:

  • The Executive Order declares that information is a valuable resource and strategic asset for the nation.
  • Newly generated government data will be required to be made available in open, machine-readable format by default -- enhancing their accessibility and usefulness, and ensuring privacy and security.
  • These executive actions will allow entrepreneurs and companies to take advantage of this information -- fueling economic growth in communities across the Nation.

Data, data, data! I love me some data!

Now, to turn it into information, knowledge and maybe even wisdom, that's the hard part.

Thursday, December 13, 2012

Rest up with the REST JSON/JSONP "Open Beer Database" API

Visual Studio Magazine - Beer? There's an API for that!

I've been fooling around with REST services, getting JSON data back from free online sources and displaying it in Web or Windows Store apps via a ListView or FlipView, and so on.

After experimenting with the Windows Azure Mobile Services, which simplifies the back-end data-access process and lets you easily set up your own services, I was trying out other APIs and just had to pass on my latest discovery: beer.

Yup, there's an Open Beer Database, described as "a free, public database and API for beer information." Now, that's my kind of information ...

Anyway, note that the Open Beer API "is currently a work-in-progress and is subject to change without notice." It returns data in JSON or JSONP (to work around cross-domain calls). It provides the usual CRUD operations via HTTP verbs GET, POST, PUT and DELETE and lets you retrieve breweries or beers, both as aggregates or singly by ID number.

..."

Open Beer Database

The API is currently a work-in-progress and is subject to change without notice.

Overview

Caching

Currently no requests are cached.

Rate Limit

The API is not currently rate limited. Please use good judgment when designing your application.

Responses

Currently the only supported formats are JSON (application/json) and JSONP (text/javascript).

Public Token

A public token has read-only access.

Private Token

A private token has read and write access.

image

Open Beer Database - Displaying Breweries via JSONP

image

Beer! REST API! Beer!

Sunday, August 12, 2012

Free Big Data eBook of the Day, "Mining of Massive Datasets"

Mining of Massive Datasets

The book has now been published by Cambridge University Press. The publisher is offering a 20% discount to anyone who buys the hardcopy Here. By agreement with the publisher, you can still download it free from this page. Cambridge Press does, however, retain copyright on the work, and we expect that you will obtain their permission and acknowledge our authorship if you republish parts or all of it. We are sorry to have to mention this point, but we have evidence that other items we have published on the Web have been appropriated and republished under other names. It is easy to detect such misuse, by the way, as you will learn in Chapter 3.

Download Version 1.0

The following materials are equivalent to the published book, with errata corrected to July 4, 2012. It has been frozen as we revise the book. The evolving book can be downloaded as "Version 1.1" below.

Download the Complete Book (340 pages, approximately 2MB) [GD: Click through for all the downloads]

Download chapters of the book:

Preface and Table of Contents
Chapter 1 Data Mining
Chapter 2 Large-Scale File Systems and Map-Reduce
Chapter 3 Finding Similar Items
Chapter 4 Mining Data Streams
Chapter 5 Link Analysis
Chapter 6 Frequent Itemsets
Chapter 7 Clustering
Chapter 8 Advertising on the Web
Chapter 9 Recommendation Systems
Index

Download Version 1.1

Below is a draft, evolving version of the MMDS book. We have added Jure Leskovec as a coauthor, and at this point added only one new chapter, on mining large graphs. However, we will be making available new chapters on large-scale machine-learning algorithms and dimensionality reduction, as well as expanding Chapter 2 on map-reduce algorithm design.

Download the Complete Book (395 pages, approximately 2.4MB)

Download chapters of the book:

Preface and Table of Contents
Chapter 1 Data Mining
Chapter 2 Large-Scale File Systems and Map-Reduce
Chapter 3 Finding Similar Items
Chapter 4 Mining Data Streams
Chapter 5 Link Analysis
Chapter 6 Frequent Itemsets
Chapter 7 Clustering
Chapter 8 Advertising on the Web
Chapter 9 Recommendation Systems
Chapter 10 Mining Social-Network Graphs
Index

From the Preface of v1.1

This book evolved from material developed over several years by Anand Rajaraman and Jeff Ullman for a one-quarter course at Stanford. The course CS345A, titled “Web Mining,” was designed as an advanced graduate course, although it has become accessible and interesting to advanced undergraduates. When Jure Leskovec joined the Stanford faculty, we reorganized the material considerably. He introduced a new course CS224W on network analysis and added material to CS345A, which was renumbered CS246. The three authors also introduced a large-scale data-mining project course, CS341. The book now contains material taught in all three courses.

What the Book Is About
At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Further, the book takes an algorithmic point of view: data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some sort. The principal topics covered are:

1. Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data.
2. Similarity search, including the key techniques of minhashing and locality-sensitive hashing.
3. Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost.
4. The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach.
5. Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements.
6. Algorithms for clustering very large, high-dimensional datasets.
7. Two key problems for Web applications: managing advertising and recommendation systems.
8. Algorithms for analyzing and mining the structure of very large graphs,especially social-network graphs.

If you're really big into big data, or wanna-be, this eBook looks to be just for you.

(via Jason Haley - Interesting Finds: August 12, 2012)

Saturday, August 04, 2012

Creating the Complex [databases for testing] Series

SamLester - Creating Complex Test Databases - Intro

As a very brief intro, I have worked as a tester in SQL Server for the past 10+ years on many different features. Along the way, we develop and test features and release them to the public only to discover some customers inevitably encounter bugs when they run the features against their databases. How can this happen when we have amazing PMs and developers, devoted and talented test teams, and thousands of automated test cases per feature? The answer often lies in the incredible complexity of customer databases running on SQL Server and the evolution of those databases as they have grown from small to very complex databases over the years. As testers, we have a few different options to try to mitigate this problem and represent "all possible databases" in our testing, but it is impossible to test every possible permutation of databases based on this complexity. In practice, we do all of these to an extent and are constantly working on improving each of them.

Some options are to:

  • Acquire real customer databases - these are often the best test databases, but pose many challenges to acquire due to size, network, security, PII, NDAs, etc. We often work with our internal Microsoft product teams who run large scale database applications to leverage their DBs. (Dreaming out loud: I'd love to try to work with customers to figure out a way to get more "privacy scrubbed" customer DBs into our test environment. Microsoft products get better coverage, the customer applications are guaranteed to work, and we all win. I'll blog more on this later, but send me a message if you're interested in working together to get your scrubbed databases in our test bed.)
  • Programmatically write tools that can create many permutations of databases with various objects, properties, relationships, etc. Feed various inputs into this tool to create different test databases. We have had pretty good success with this model as we're able to use many smart testing techniques and create some great test databases that uncover some great bugs.
  • Maintain a database of "interesting" syntax and write automated data-driven test cases based on this object level syntax. As we encounter any new bug, distill the bug down to the problematic syntax and add that to our existing syntax database.
  • Handcraft complex databases with very specific requirements based on the testing needed for a particular feature/sign-off.

The last option (handcrafted databases) are often our last resort, but result in the most effective method for ensuring that specific features work for specific test cases. Our dev, test, and PM team spent some time recently for a feature we are working on to come up with the list of "complex" databases that we do not have in our test environment, but would like to add. Over the next few blog posts, I'll cover some of the interesting databases and the techniques I used to create them. Here are a few of the DBs we had in mind:

SamLester - Creating Complex Test Databases - One Table for each of the 2,397 supported Collations

As a follow up to my post on complex test databases, this article will cover one of the more interesting test DBs I recently created.

Goal: Create a database that includes one table for each supported collation. Each table contains a single column with the various column level collations supported by SQL Server 2012 (nearly 2,400 different collations supported).

The first step here is to determine where we can find the exhaustive list of supported collations. The answer comes from the built-in table-valued function, fn_helpcollations, that returns the list of supported collations in SQL Server 2012. Once we have the exhaustive list of supported collations, we need to determine how we will leverage this list to create one table for each collation. If we were to do this manually, we would write out the following CREATE TABLE statements:

create table T1 (c1 nvarchar(50) collate Albanian_100_BIN)
create table T2 (c1 nvarchar(50) collate Albanian_100_BIN2)
create table T3 (c1 nvarchar(50) collate Albanian_100_CI_AI)

create table T2395 (c1 nvarchar(50) collate Yakut_100_CS_AS_KS)
create table T2396 (c1 nvarchar(50) collate Yakut_100_CS_AS_KS_WS)
create table T2397 (c1 nvarchar(50) collate Yakut_100_CS_AS_WS)

The repetition of these statements makes them good candidates for scripting using T-SQL. By leveraging the ROW_NUMBER function as the table numeric identifier, we're able to put together the following statement:

..."

SamLester - Creating Complex Test Databases - Creating a Database with 1 Billion Random Rows

"As part of my series on creating databases with interesting characteristics for testing purposes, today we'll create a database containing a large number of tables, each with a large number of rows inserted.

Goal: Create a database that contains 1,000 tables. Each table contains 5 integer columns and should contain 1,000,000 random rows of data.

1,000 tables x 1,000,000 rows/table = 1,000,000,000 rows of data

One billion rows? Really? Yes, 1 billion rows! And random data, please.

For this task, we can break it down into two steps. The first is to create the 1,000 tables, which we can easily accomplish through a TSQL script. The next part is to populate to the tables with data. For this task, we will leverage the Data Generation tool in Visual Studio 2010.

..."

A co-worker mentioned a bit ago that his team needed something like this, safe yet complex data at a production scale. While this series isn't a prefect problem fit yet, it looks like it might evolve into one. Worth shooting off an email to him anyway... :)

Wednesday, August 01, 2012

Shining a light on your State Government data with help from the Sunlight Foundation and Open States site

Sunlight Foundation - Check Out the Open States Beta Site

"If you don't read the Sunlight Labs blog as religiously as I do (you should!), you might miss that our Open States project now has a public beta site up and running. Users can find who their state reps are, their voting record and contact information, the most recent actions taken by the legislature and search the full text of current and past bills. The OpenStates.org site is now the best place to find information on the activities in these 20 state legislatures: Alaska, Arizona, California, District of Columbia, Delaware, Florida, Hawaii, Idaho, Illinois, Louisiana, Maryland, Minnesota, Montana, North Carolina, New Hampshire, New Jersey, Ohio, Texas, Utah and Wisconsin with more coming soon.

For more information on the new site read my colleague James Turk's post here. Also be sure to play with the iOS app or, if you're a developer, use the API and join the Google Group. Our Scout project incorporates the Open States API to allow users to follow and search bills in all 50 states.

..."

OPEN : StatesCalifornia

image

image

image

The best part?

image

An API, bulk downloads, RSS feeds, and even the source to the entire site!

If you're looking for data about what your state government is doing or want to write an app that provides that data, Open : States looks like a must use resource.