Showing posts with label EDD. Show all posts
Showing posts with label EDD. Show all posts

Thursday, October 02, 2014

Getting started with the free (for 1000 calls) Text Analysis API from AYLIEN

Text Analysis blog | Aylien - How to Get Started with AYLIEN Text Analysis API


Getting up and running with AYLIEN’s Text Analysis APIs couldn’t be easier. It’s a simple 3 part process from signing up to calling the API. This blog will take you through the complete process of creating an account, retrieving your API Key and Application ID, and making your first call to the API.

Part 1: Signing up for a free account

Navigate to and click on the “Subscribe For free” button. This will bring you to a sign up form which will ask for your details in order to setup your account and generate your credentials.

By signing up, you will get access to our basic plan which will allow you to make 1,000 API calls per day for free. Note: There is no credit card needed to get access to our basic plan. ;)


Part 3: Creating your first application
Our getting started guide is designed to get you up and running with the API and making calls as quickly and as easily as possible. Here you will find information on the API Documentation, Features, Links to a demo and some code snippets.

We have included sample code snippets for you to use in the following languages.

  • Java
  • Node.js
  • Python
  • Go
  • PHP
  • C#
  • Ruby

To start making calls, while you’re on the getting started page, scroll down to the “Calling the API” section. Choose which language you wish to use and take a copy of the code snippet. In this example, we are going to use Node.js.



Okay, 1,000 calls is not enough to build a biz on (not that you would) but it is more than enough to play with and still do some cool things. Imagine using this in your blogging, where you gather some cool text analysis info automagically from your post. Or spread out over time, analysis of all your posts. Or maybe a means to help you filter down your news stream. Or... or... or... There's a ton of stuff you can do with an API like this and being free'ish, you can play for, well, free.

Thursday, September 25, 2014

"Email Forgery Analysis..."

Email Forgery Analysis in Computer Forensics

Emails are usually at the top of the list when it comes to potentially relevant electronically stored information (ESI) sources. They often capture critical business correspondence, agreements, business documents, internal company discussions etc. They are also one of the most frequently forged document types. They can be altered in many ways such as by backdating, changing the sender, recipients or message contents. Fortunately, email servers and client computers often contain various metadata which can be used for forensic email forgery analysis.

One of these metadata fields is the Conversation Index property. I previously wrote about E-mail Conversation Index Analysis and how it can be useful in forensic analysis of e-mails, particularly email forgery analysis. In this post, we will put that weapon to use — along with other computer forensics techniques — and take a close look at a sample fraudulent email message.




As the use of electronic documents as evidence in legal proceedings is becoming more and more popular, so is email forgery, electronic document date forgery and other electronic fraud. However, electronic documents usually contain numerous metadata fields, rendering most forgery attempts discoverable. Email transport headers and other metadata such as the Conversation Index, Sent Time and Delivery Time Microsoft Outlook Messaging API (MAPI) Properties are just a few of the numerous metadata fields computer forensics experts can use during email forgery analysis."

There's some great information in this post (and linked ones) and is perfect for you CSI guys (and those of you in the Legal/ESI/eDiscovery world).

Understanding a Sentiment Analysis Engine

Microsoft Lystavlen - the Online display board - Understanding the Sentiment Engine in Microsoft Social Listening

Sentiment Analysis

If you want to see how the public perceives your company or product, you can use sentiment analysis, which determines people’s attitudes toward a topic. Sentiment analysis reflects the public perception of a post’s content in relation to the keywords that were used to find the post (a post is a eg a Twitter post or a Facebook comment)

Each post that results from your defined search queries is processed by the sentiment engine in the original language and annotated with a calculated sentiment value. Sentiment values are provided for the following languages:

  1. English
  2. German
  3. French
  4. Spanish
  5. Portuguese
  6. Italian

The sentiment value results in a positive, negative, or neutral sentiment for a post. Occasionally, the algorithm identifies positive and negative parts of a sentence and still rates the post as neutral. This happens because the amount of a post’s text identified as positive or negative cancel each other out. A post is also classified as neutral if there are no positive or negative statements detected in it.

Note that the sentiment algorithm is not a self-learning system, even if you can edit any post’s sentiment value in the post list.

Understanding the Sentiment Engine

Lets look closer at the sentiment engine using the example post below, in the context of the search topic "Windows Phone"



I thought this post a great explanation of Sentiment Analysis, which comes up in my day job now and then. While we don't use the Social Listen or related product, we do use a library that lets us integrate like functionality into our LOB, but e apps, explaining how it works quickly has always been "fun". This post and explanation will come in handy next time...


Related Past Post XRef:
Comparing Sentiment Analysis REST API's
10 Professionals, 10 views on the coming trends in text analytics

Tuesday, September 16, 2014

Logikcull is trying to help make eDiscovery logical... and giving you a free eDiscovery Education and sandbox to play in too!

Logikcull - Logikcull Launches On-line e-Discovery Education for the Legal Community. And it’s Free.

WASHINGTON, DC, September 15, 2014 - Logikcull announced today the launch of its e-Discovery Education for Everyone initiative. With this initiative, lawyers around the world will be able to learn e-discovery by performing e-discovery, for free, and without the need for special software or hardware. Simply log-in and Logikcull will automatically teach you how to do e-discovery. Each free Logikcull account comes with a set of “sandbox-data” that lawyers can use to learn about metadata, de-duplication, and many other technical aspects of e-discovery.

Understanding e-discovery is becoming increasingly important to the legal community. For instance, California has an open ethics opinion that states lawyers that don’t attempt to learn about e-discovery may face malpractice claims. The Logikcull e-Discovery Education for Everyone initiative hopes to mitigate that risk for attorneys everywhere.

“E-discovery education is vital for lawyers practicing in the 21st century—which is to say, all of us. The time is nigh when not understanding e-discovery will be malpractice; but why wait until then? If you want to remain competitive and competent, you’d better know about it now.” said Mark Wilson, writer for who recently wrote about “Is Not Understanding e-Discovery Unethical?”

One of the problems with learning about e-discovery is just that: you learn about it in the very traditional CLE-way. But usually, you don’t actually get to practice e-discovery until it’s too late. And if it’s too late, you may end up making poor e-discovery choices that can result in professional misconduct. With Logikcull, lawyers can learn e-discovery by doing e-discovery in a safe and real-world environment. This learn-by-doing environment will better prepare legal professionals when e-discovery demands arise.


This is a first I think, a public eDiscovery sandbox to learn and play in. You know I've said, over and over, that even if you're not in a legal firm/department/etc, that even if you're just an IT guy and gal, if you have "customers" then there's a chance your firm might be involved in a legal matter one day. The more you know about what this "eDiscovery" thing is the better you'll be able to help your firm (or at least understand what your lawyers are asking for and why...)

One thing to understand, this is a "request" for a sandbox. It's not immediate (though the automated emails come in fast...). I'm hoping this is sales-hands-off and just a site/service I can use, learn from and play with. If this is just a "demo-with-data" that "we'll walk you through..." i.e. salesware, I'll be disappointed. But these guys seem be pretty unusual for a firm (see the XRef's below), so I'm keeping my fingers crossed.


Related Past Post XRef:
Something we all need sometimes, some “Logik Redaction”
Every industry deserves it’s own apparel, doesn’t it? Now there’s apparel for the EDD Guy or Gal in you…

Wednesday, July 30, 2014

"The Art of Memory Forensics"

Windows Incident Response - Book Review: "The Art of Memory Forensics"


I recently received a copy of The Art of Memory Forensics (thanks, Jamie!!), with a request that I write a review of the book.  Being a somewhat outspoken proponent of constructive and thoughtful feedback within the DFIR community, I agreed.

This is the seminal resource/tome on memory analysis, brought to you by THE top minds in the field.  The book covers Windows, Linux, and Mac memory analysis, and as such must be part of every DFIR analyst's reading and reference list.  The book is 858 pages (not including the ToC, Introduction, and index), and is quite literally packed with valuable information.


If you have an interest in memory analysis, this is THE MUST-HAVE resource!  To say that if you or anyone on your team is analyzing Windows systems and doesn't have this book on your shelf is wrong, is wholly incorrect.  Do NOT keep this book on a shelf...keep it on your desk, and open!  Within the first two weeks of this book arriving into your hands, it should have a well-worn spine, and dirty finger prints and stains on the pages!  If you have a team of analysts, purchase multiple copies and engage the analysts in discussions.  If one of your analysts receives a laptop system for analysis and the report does not include information regarding the analysis of the hibernation file, I would recommend asking them why - they may have a perfectly legitimate reason for not analyzing this file, but if you had read even just a few chapters of this book, you'd understand why memory analysis is too important to ignore. "

Not something I really need right now nor probably many of you, but I still think it's pretty darn cool looking and talk about a geek level-up tool! :)

Tuesday, June 10, 2014

MAPI ain't dead, it's MAPI/HTTP!

A few years ago I reblogged about a post that implied MAPI was dead, Exchange 2013 says "See ya MAPI and goodbye Outlook 2003!" Exchange 2013 drops MAPI support.

Well it ain't. MAPI over TCP is (dead'ish), but MAPI itself is alive and well and moving forward into a more connected world...

João Ribeiro - What is MAPI over HTTP ?

MAPI over HTTP is a new transport used to connect Outlook and Exchange. MAPI/HTTP was first delivered with Exchange 2013 SP1 and Outlook 2013 SP1 and begins gradually rolling out in Office 365 in May. It is the long term replacement for RPC over HTTP connectivity (commonly referred to as Outlook Anywhere). MAPI/HTTP removes the complexity of Outlook Anywhere’s dependency on the legacy RPC technology.


The Exchange Team Blog - Outlook Connectivity with MAPI over HTTP

Among the many new features delivered in Exchange 2013 SP1 is a new method of connectivity to Outlook we refer to as MAPI over HTTP (or MAPI/HTTP for short). We’ve seen a lot of interest about this new connection method and today we’ll give you a full explanation of what it is, what it provides, where it will take us in the future, and finally some tips of how and where to get started enabling this for your users.

What is MAPI over HTTP?

MAPI over HTTP is a new transport used to connect Outlook and Exchange. MAPI/HTTP was first delivered with Exchange 2013 SP1 and Outlook 2013 SP1 and begins gradually rolling out in Office 365 in May. It is the long term replacement for RPC over HTTP connectivity (commonly referred to as Outlook Anywhere). MAPI/HTTP removes the complexity of Outlook Anywhere’s dependency on the legacy RPC technology. Let’s compare the architectures.


MAPI/HTTP moves connectivity to a true HTTP request/response pattern and no longer requires two long-lived TCP connections to be open for each session between Outlook and Exchange. Gone are the twin RPC_DATA_IN and RPC_DATA_OUT connections required in the past for each RPC/HTTP session. This change will reduce the number of concurrent TCP connections established between the client and server. MAPI/HTTP will generate a maximum of 2 current connections generating one long lived connection and an additional on-demand short-lived connection.

Outlook Anywhere also essentially double wrapped all of the communications with Exchange adding to the complexity. MAPI/HTTP removes the RPC encapsulation within HTTP packets sent across the network making MAPI/HTTP a more well understood and predictable HTTP payload.

An additional network level change is that MAPI/HTTP decouples the client/server session from the underlying network connection. With Outlook Anywhere connectivity, if a network connection was lost between client and server, the session was invalidated and had to be reestablished all over again, which is a time-consuming and expensive operation. In MAPI/HTTP when a network connection is lost the session itself is not reset for 15 minutes and the client can simply reconnect and continue where it left off before the network level interruption took place. This is extremely helpful for users who might be connecting from low quality networks. Additionally in the past, an unexpected server-side network blip would result in all client sessions being invalidated and a surge of reconnections being made to a mailbox server. Depending on the number of Outlook clients reconnecting, the re-establishing of so many RPC/HTTP connections might strain the resources of the mailbox server, and possibly extend the outage in scope (to Outlook clients connected to multiple servers) and time, caused by a single server-side network blip.

Why MAPI over HTTP?


settings. This makes it easier to roll out changes in authentication settings for Outlook.

The future

MAPI/HTTP puts the Exchange team in position to innovate more quickly. It simplifies the architecture removing dependency on the RPC technologies which are no longer evolving as quickly as the customers demand. It provides the path for extensibility of the connection capabilities. A new capability that is on the roadmap for Outlook is to enable multi-factor authentication for users in Office 365. This capability is made possible with the use of MAPI/HTTP and is targeted to be delivered later this year. For a deeper look at this upcoming feature you can review the recent Multi-Factor Authentication for Office 365 blog post. This won’t stop with Office 365 MFA, but provides the extensibility foundation for 3rdparty identity providers.

How does MAPI/HTTP work?

Let’s walk through the scenario of an Outlook 2013 SP1 client connecting to Exchange Server 2013 SP1 after MAPI/HTTP has been enabled.


What’s required?

So now we have a clear set of advantages you can offer users, let’s review the requirements to enable MAPI/HTTP.


Now deploy MAPI/HTTP

Now that you have prepared your servers with SP1, updated your clients, and reviewed potential sizing impacts you are ready to get on with implementing MAPI/HTTP. It is disabled by default in SP1 and you must take explicit actions to configure and enable it. These steps are well covered in the MAPI over HTTPTechNet article.

A few important things to remember in your deployment.


How do I know it is working?

There are a few quick ways to verify your configuration is working as expected.



MAPI/HTTP provides a simplified transport and resulting architecture for Outlook to connect with Exchange. It enables improved user experiences to allow them faster access to mail and improves the resilience of their Outlook connections. These investments are the foundation for future capabilities such as multi-factor authentication in Outlook. It also helps IT support and troubleshoot client connection issues using standard HTTP protocol tools.

As with all things new you must properly plan your implementation. Use the deployment guidanceavailable on TechNet and the updated sizing recommendations in the calculator before you start your deployment. With proper use it will guide you to a smooth deployment of MAPI/HTTP.

Special thanks to Brian Day and Abdel Bahgat for extensive contributions to this blog post.

Brian Shiers | Technical Product Manager


We collected a number of questions which frequently came up during the development, internal dogfooding, and customer TAPtesting of MAPI/HTTP. We hope these answer most of the questions you may have about MAPI/HTTP.


So there, MAPI ain't dead, but is instead better than ever!

Monday, May 19, 2014

V3 ERDM Diagram gets more IG (Information Governance)

EDRM - New EDRM Diagram Emphasizes Information Governance

SAINT PAUL, Minn. – May 19, 2014 – EDRM, the leading standards organization for the e-discovery market, today announced the release of Version 3 of the Electronic Discovery Reference Model (EDRM) diagram. Originally published in 2006, the framework is a popular tool used by legal professionals and others involved in e-discovery to help clarify processes and expectations among project stakeholders.

Version 3 of the EDRM diagram offers significant updates, primarily to express the importance of information governance (IG) as a key piece of the electronic discovery process. The new model is as follows:


The leftmost item in the model has been renamed “Information Governance” and its shape has been changed from a rectangle to a circle. These edits better align this diagram with EDRM’s Information Governance Reference Model (IGRM). The adoption of a circle also is meant to show that every well-managed e-discovery process should start and end with sound information governance.


Look, it's my day job, you know, the one that pays the bills? So sometimes I have to blog about day job related stuff (that and I've been following this project since 2005). This diagram is one such thing. Since it came out, it's become a standard part of any EDD/ESI presentation, vendor display and part of the vernacular. Any change, even a small, is news of note...


Related Past Post XRef:
There's a new eDiscovery diagram in town... "Electronic Discovery Best Practices" at

Time ENF? "ENF, a New Standard for Managing Native Files"
PII Problems in the Public Enron Data Set (aka "Industry Ouch")
And even more Enron (PST’s that is) We’re talking 107GB, compressed, of data…
EDRM Enron Reference Data v2 now available
Need a ton of email data (10’s of gig’s)? Need it in PST form? Need it to be public data? Want to look behind the curtain into Enron? The EDRM Data Set Project is for you…
EDRM - Electronic Discovery Reference Model

Thursday, May 15, 2014

Go direct to...SMB Direct - If you're accessing large files,heavily accessed files via SMB...

Tip of the Day - Tip of the Day: SMB Direct

Today’s Tip…

Windows Server 2012 includes a new feature called SMB Direct, which supports the use of network adapters that have Remote Direct Memory Access (RDMA) capability. Network adapters that have RDMA can function at full speed with very low latency, while using very little CPU. For workloads such as Hyper-V or Microsoft SQL Server, this enables a remote file server to resemble local storage. SMB Direct includes:

  • Increased throughput: Leverages the full throughput of high speed networks where the network adapters coordinate the transfer of large amounts of data at line speed.
  • Low latency: Provides extremely fast responses to network requests, and, as a result, makes remote file storage feel as if it is directly attached block storage.
  • Low CPU utilization: Uses fewer CPU cycles when transferring data over the network, which leaves more power available to server applications.

SMB Direct is automatically configured by Windows Server 2012. [GD: Post Leached in full]'ll want to check this out. Say you're accessing really large PST's via a network share and it's not working out real well, this might be something you should run to check out. The problem might be though that this is a Server 2012 feature and you're accessing those resources from a Win7 box... hum... Will have to think about that.


Related Past Post XRef:
Pst... Storing PST's on a network share? Still a no-no...

Thursday, April 24, 2014

Andy Warhol Amiga Love... Lost art retrieved from Amiga Floppy disks

ars technica - Lost Warhol works uncovered from old Amiga floppy disks


A collection of Warhol works were uncovered in March on a set of old Amiga floppy disks, according to a press release by the Studio for Creative Inquiry (via BoingBoing). The files were eased off of the disks with help from the Carnegie Mellon Computer Club, a collective that specializes in dealing with old computer hardware.

The works were obtained from hardware that was sitting dormant in the Warhol Museum, including "two Amiga 1000 computers in pristine condition," an "early drawing tablet," and "a large collection of floppy diskettes comprised of mostly commercial software."

The fact that the floppy disks contained commercial software as opposed to saved works initially disappointed the team. However, they soon discovered some original and signed works on a GRAPHICRAFT floppy after using a Kickstart ROM to boot the emulator.


A fuller description of the technical process is available in PDF form, and a documentary film about the project will screen at the Carnegie Library Lecture Hall in Pittsburgh on May 10.

Awesome Amiga news. Amiga lives! :)

This also kind of relates to my day job in the eDiscovery world, as every so often we have to deal with stuff kind of like this. I remember trying to hunt down a  5 1/4 drive so we could try to read some real floppies... lol

Tuesday, April 08, 2014

Here comes the Sun? The Cloud is becoming less scary for businesses, according to this survey at least...

eDiscovery Daily Blog - Cloud Security Fears Diminish With Experience - eDiscovery Trends

One of the more common trends identified by thought leaders in our recently concluded thought leader series was the continued emergence of the cloud as a viable solution to manage corporate big data.  One reason for that appears to be greater acceptance of cloud security.  Now, there’s a survey that seems to confirm that trend.


Forbes - Cloud Security Fears Diminish With Experience, Survey Shows

Security is always the leading fear among companies just starting to dip their toes into the cloud computing realm. However, as time passes and they gain experience, their security worries vanish.

That’s one of the takeaways from a recent survey of 1,068 companies conducted by RightScale, Inc. The survey’s authors report that while the benefits of the cloud increase with experience, the challenges of cloud show a sharp decrease as organizations gain expertise with cloud. Close to one-third of executives and professionals who have not yet implemented cloud say security is their top concern, a number that diminishes to 13 percent of seasoned, heavy users of cloud services (and is only the fifth-ranked concern on their list).

One-fourth of respondents did not have clouds in place, while another 22 percent were seasoned cloud veterans, the survey finds. The reduced concern about security reflects a comfort level that increases as the time spent with cloud engagements increases. That doesn’t mean slacking off on security, of course — ultimately, security is the responsibility of the end-user company.


Rightscale - 2014 State of the Cloud Report: See the Latest Trends on Cloud Adoption

The RightScale 2014 State of the Cloud Report includes data and analysis on cloud adoption by enterprises and SMEs in a dozen industries.

Download the report to find out:

  • How you compare in cloud adoption relative to other companies
  • What progress enterprises are making in the journey to hybrid cloud.
  • Key challenges in enterprise cloud strategy and governance.
  • How DevOps and Self-Service IT align with cloud initiatives.
  • Why competition among cloud providers is heating up and how you can benefit.


Executive Summary

In February 2014, RightScale surveyed 1068 technical professionals across a broad cross-section of organizations about their adoption of cloud computing.

The 2014 State of the Cloud Survey identified several key findings:

Cloud adoption reaches ubiquity.
• 94 percent of organizations surveyed are running applications or experimenting with infrastructure-as-a-service.
• 87 percent of organizations are using public cloud.

Hybrid cloud is the approach of choice.
• 74 percent of enterprises have a hybrid cloud strategy and more than half of those are already using both public and private cloud.

Enterprise cloud governance lags adoption.
• Less than a third of organizations have defined such critical aspects of governance as which clouds can be used, disaster recovery approaches, and cost management.

The challenge of cloud security is abating.
• The number of respondents who regard cloud security as a significant challenge has decreased among both cloud beginners and cloud pros.

Next-generation IT shapes up as Cloud + DevOps + Self-Service IT.
• Cloud-focused companies embrace DevOps (71 percent) and Self-Service IT (68 percent).

Amazon Web Services (AWS) continues to dominate public cloud adoption, while other vendors battle for second place. Key findings include:
• AWS adoption is 54 percent – 4x the nearest competitor.
• Rackspace Public Cloud is second within the SMB segment.
• IaaS offerings from Google and Microsoft are gaining the interest of cloud users, with Azure leading among enterprises and Google Cloud Platform among small and midsize organizations.

The battle among private cloud technologies is shaping up as a clash of cultures between the open-source OpenStack and proprietary solutions from VMware. Findings include:
• Thirty-one percent of enterprise respondents view their VMware vSphere/vCenter environments as a private cloud.
• OpenStack is well positioned to unseat vSphere in private cloud – coming in first in interest and second in current usage.
• Microsoft System Center is waiting in the wings with a strong third position among enterprise users.


Key for this post, "The challenge of cloud security is abating." Interesting thing is that I got the same feeling in talking with my co- attendees at Build, that there's a growth in acceptance, usage and interest. interest was VERY high at the individual level, with many talking about how they are going to use their MSDN Azure allowance to at least play with it...

Thursday, February 27, 2014

Making Relativity relatively faster... Partition it baby, (sometimes)

Brent Ozar - How to Use Partitioning to Make kCura Relativity Faster

kCura Relativity is an e-discovery program used by law firms to find evidence quickly. I’ve blogged about performance tuning Relativity, and today I’m going to go a little deeper to explain why DBAs have to be aware of Relativity database contents.

In Relativity, every workspace (case) lives in its own SQL Server database. That one database houses:

  • Document metadata – where the document was found, what type of document it is
  • Extracted text from each document – the content of emails, spreadsheets, files
  • Document tagging and highlighting – things the lawyers discovered about the documents and noted for later review
  • Workspace configuration – permissions data about who’s allowed to see what documents
  • Auditing trails – who’s searched for what terms, what documents they’ve looked at, and what changes they made

For performance tuners like me, that last one is kinda interesting. I totally understand that we have to capture every activity in Relativity and log it to a table, but log-sourced data has different performance and recoverability requirements than other e-discovery data.


However, I don’t recommend doing this by default across all your databases. This technique is going to instantly double the number of databases you have and make your management much more complex. However, I do recommend reviewing your largest workspaces to see if AuditRecord is consuming half or more of the database space. If so, consider partitioning their AuditRecord tables to get faster backups, database maintenance jobs, and restores.

At the risk of sounding like a fanboy, this is one of the reasons I love working with the kCura folks. They really care about database performance, they take suggestions like this, and they implement it in a way that makes a real difference for customers.

This is also why database administrators need to:

  1. Understand the real business purpose of the biggest tables in their databases
  2. Build working, productive relationships with their software vendors
  3. Come up with creative approaches to ease SQL Server pains
  4. Help the vendors implement these approaches in software


If you're a Relativity shop, Brent's one of those "must go to dba guys," which this post makes very apparent...


Related Past Post XRef:
Making SQL Server a happy kCura Relativity camper (and your users too)

sp_AskBrent - Your new, "OMG, my SQL Server is sooo slow" free uber SP from Brent Ozar
Two SQL Server Resources that you might want to take another look at...

Free Training SQL Server Training DVD’s (or online) from Quest (reg-ware) - 12 Sessions, Two DVD’s, Zero cost…

"How to Develop Your DBA Career" Free eBook (and posters and whitepapers and more [oh my])

Friday, February 21, 2014

Windows File System and Whitespace characters, do you know the rules?

Support for Whitespace characters in File and Folder names for Windows 8, Windows RT and Windows Server 2012


File and Folder names that begin or end with the ASCII Space (0x20) will be saved without these characters. File and Folder names that end with the ASCII Period (0x2E) character will also be saved without this character. All other trailing or leading whitespace characters are retained.
For example:

  • If a file is saved as ' Foo.txt', where the leading character(s) is an ASCII Space (0x20), it will be saved to the file system as 'Foo.txt'.
  • If a file is saved as 'Foo.txt ', where the trailing character(s) is an ASCII Space (0x20), it will be saved to the file system as 'Foo.txt'.
  • If a file is saved as '.Foo.txt', where the leading character(s) is an ASCII Period (0x2E), it will be saved to the file system as '.Foo.txt'.
  • If a file is saved as 'Foo.txt.', where the trailing character(s) is an ASCII Period (0x2E), it will be saved to the file system as 'Foo.txt'.
  • If a file is saved as ' Foo.txt', where the leading character(s) is an alternate whitespace character, such as the Ideographic Space (0x3000), it will be saved to the file system as ' Foo.txt '. The leading whitespace characters are not removed.
  • If a file is saved as 'Foo.txt ', where the trailing character(s) is an alternate whitespace character, such as the Ideographic Space (0x3000), it will be saved to the file system as 'Foo.txt '. The trailing whitespace characters are not removed.

File and Folder names that begin or end with a whitespace character are enumerated differently by the Win32 and WinRT APIs due to ecosystem requirements.

Whitespace Characters
There are various whitespace characters representing various 'space' widths (glyphs). Only the ASCII Space (0x20) and ASCII Period (0x24) characters are handled specially by the Object Manager. Although the Ideographic Space character (0x3000) is also generated by using the Spacebar (when IME is enabled), it is not handled specially.
  • 0x0020 SPACE
  • 0x2000 EN QUAD
  • 0x2001 EM QUAD
  • 0x2002 EN SPACE
  • 0x2003 EM SPACE
  • 0x2005 FOUR-PER-EM SPACE
  • 0x2006 SIX-PER-EM SPACE
  • 0x2007 FIGURE SPACE
  • 0x2009 THIN SPACE
  • 0x200A HAIR SPACE
Object Manager
ASCII Space (0x20) characters at the beginning or end of a file or folder name are removed by the Object Manager upon creation.
ASCII Period (0x2E) characters at the end of a file or folder name are removed by the Object Manager upon creation.
All other leading or trailing whitespace characters are retained by the Object Manager.
API Enumeration
Win32 API
The Win32 API (CreateFile, FindFirstFil, etc.) uses a direct method to enumerate the files and folders on a local or remote file system. All files and folders are discoverable regardless of the inclusion or location of whitespace characters.
The WinRT API is designed to support multiple data providers (Physical Drives, OneDrive (formerly SkyDrive), Facebook, etc.). To achieve this, WinRT API uses a search engine to enumerate files and folders. Due to the search approach to enumeration, the WinRT API (StorageFile, StorageFolder, etc.) does not handle file and folder names with trailing whitespace characters other than ASCII Space (0x20) and ASCII Period (0x2E) residing on a local or remote file system. It does handle leading non-ASCII whitespace characters.
Observed Behavior
File Explorer and Desktop applications
All files and folders are visible within File Explorer and Desktop applications regardless of inclusion or location of whitespace characters.
Windows Store applications

When using the File Picker, files with a trailing non-ASCII whitespace character do not appear. The contents of sub-folders with a trailing non-ASCII whitespace characters are not displayed in the File Picker. Files or folders containing a leading non-ASCII whitespace character are displayed.


This is something I run into all the time, Windows' automagic handling of beginning/trailing whitespaces, and code that doesn't honor that (cough... like mine sometimes).

What the heck am I talking about?

Imagine you're writing an email export app, and you are using the subject line as the file name, and you're recording that path in a DB somewhere. Sure, you already know to handle special characters, like colons, astricks, etc. But you "know" spaces are okay in a file name, so you don't sweat them. And usually you're right... But you know how many subject lines begin with a space? yeah, enough to screw you up...

If you are taking human created strings and using them as folder or file names, you need to review this KB

Tuesday, January 14, 2014

How Many "Documents" in a Gigabyte? It depends (and it's going up)

E-Discovery Search Blog - How Many Documents in a Gigabyte? An Updated Answer to that Vexing Question

For an industry that lives by the doc but pays by the gig, one of the perennial questions is: “How many documents are in a gigabyte?” Readers may recall that I attempted to answer this question in a post I wrote in 2011, “Shedding Light on an E-Discovery Mystery: How Many Docs in a Gigabyte.”

At the time, most people put the number at 10,000 documents per gigabyte, with a range of between 5,000 and 15,000. We took a look at just over 18 million documents (5+ terabytes) from our repository and found that our numbers were much lower. Despite variations among different file types, our average across all files was closer to 2,500. Many readers told us their experience was similar.

Just for fun, I decided to take another look. I was curious to see what the numbers might be in 2014 with new files and perhaps new file sizes.  So I asked my team to help me with an update. Here is a report on the process we followed and what we learned.[1]

How Many Docs 2014?

For this round, we collected over 10 million native files (“documents” or “docs”) from 44 different cases....


Including all files gets us awfully close to 5,000 documents per gigabyte, which was the lower range of the industry estimates I found. If you pull out the EML files, the number drops to 3,594.39, which is midway between our 2011 estimate (2,500) and 5.000 documents per gigabyte.

Which is the right number for you? That depends on the type of files you have and what you are trying to estimate. What I can say is that for the types of office files typically seen in a review, the number isn’t 10,000 or anything close. We use a figure closer to 3,000 for our estimates


If you're in my industry, you'll have heard this question a thousand times, seen about a million calculators and zillion charts attempting to answer this, which in the end usually is "it depends." Yet, we've been doing this now for a decade+ and are getting better at answering it. This post does a great, vendor neutral, job in attempting to answer it.

You're not in the eDiscovery/ESI/LitSupport biz? I still think you might find this data interesting as it's something you might not have normally asked or considered...

Hold on there! Exchange Online's Litigation Hold versus In-Place Hold

Welcome to the US SMB&D TS2 Team Blog - Litigation Hold versus In-Place Hold in Exchange Online

I frequently get questions about the compliance archiving capabilities available in Exchange Online and Office 365.  One area that causes a lot of confusion is around Litigation Hold versus In-Place HoldIs Litigation Hold the same as In-Place Hold?  If not, then when and why would I choose to use one versus the other?  Is Litigation Hold going away in favor of In-Place Hold?

First, a little background…  In Exchange 2010 and Exchange Online (pre-service upgrade), Litigation Hold was introduced to allow customers to immutably preserve mailbox content to meet long term preservation and eDiscovery requirements. When a mailbox was placed on Litigation Hold, mailbox content was preserved indefinitely.

In Exchange 2013 and the new Exchange Online, In-Place Hold was introduced which allowed more flexibility in preserving your data.  It allowed you to preserve items matching your query parameters, known as a query-based In-Place Hold, preserve items for a specified period, known as a time-based In-Place Hold, and also preserve everything indefinitely, which emulated the Litigation Hold feature.

After the release of Exchange 2013 and the new Exchange Online, there were initial references in the documentation and in the product itself that Litigation Hold was being deprecated, and included recommendations to use In-Place Hold instead, which added to the confusion.

I want to clarify that Litigation Hold is not being deprecated, and the references to that have been cleaned up in the product and in the documentation.  Both types are available for use and you should use the hold feature that best meets your needs.  Here are some scenarios to help you choose between the two holds.



Message Policy, Recovery and Compliance

Archiving Exchange Online-based mailboxes

Exchange Online mailboxes reside in the cloud, and archiving them requires unique hosting environments. In some cases, Exchange Online can also be used to archive on-premises mailboxes in the cloud. The options for archiving with Exchange Online are described in this section.

Exchange Online provides built-in archiving capabilities for cloud-based mailboxes, including an In-Place Archive that gives users a convenient place to store older email messages. An In-Place Archive is a special type of mailbox that appears alongside a user’s primary mailbox folders in Outlook and Outlook Web App. Users can access and search the archive in the same way they access and search their primary mailboxes. Available functionality depends on the client in use:

  • Outlook 2013, Outlook 2010, and Outlook Web App   Users have access to the full features of the archive, as well as related compliance features like control over retention and archive policies.
  • Outlook 2007   Users have basic support for the In-Place Archive, but not all archiving and compliance features are available. For example, users cannot apply retention or archive policies to mailbox items and must rely on administrator-provisioned policies instead.

Administrators use the Exchange admin center or remote Windows PowerShell to enable the personal archive feature for specific users.

For more information about In-Place Archives, see In-Place Archiving.

The Exchange Team Blog - Litigation Hold and In-Place Hold in Exchange 2013 and Exchange Online

In Exchange 2010 and Exchange Online, we introduced Litigation Hold to allow you to immutably preserve mailbox content to meet long term preservation and eDiscovery requirements. When a mailbox is placed on Litigation Hold, mailbox content is preserved indefinitely.

Placing a mailbox on Litigation Hold You can place a mailbox on Litigation Hold by using the Exchange Administration Center (EAC) or the Shell (set the LitigationHoldEnabled parameter). In Exchange 2010, you can also use the Exchange Management Console (EMC) to do this.


Preserving items for a specified duration To preserve items for a specified period, we added the LitigationHoldDuration parameter to Exchange Online. This helps you meet your compliance needs by preserving all items in a mailbox for the specified duration, calculated from the date the item was created (date received in case of inbound email). For example, if your organization needs to preserve all mailbox data for seven years, you can place all mailboxes on Litigation Hold and set the LitigationHoldDuration to 7 years (in days).

This functionality is also available in Exchange 2013, allowing you to preserve items for a specified duration in your on-premises organization – one example of how developments in Exchange Online benefit Exchange Server on-premises.

In-Place Hold in Exchange 2013 and Exchange Online

In Exchange 2013 and the new Exchange Online, we introduced In-Place Hold, which allows more flexibility in preserving your data. Hold functionality is integrated with In-Place eDiscovery to allow you to search and preserve using a single wizard or a single cmdlet (New-MailboxSearch). You can use the In-Place eDiscovery & Hold wizard or the cmdlet to search for and preserve items matching your query parameters, known as a query-based In-Place Hold, preserve items for a specified period, known as a time-based hold, and also preserve everything indefinitely, which emulates the old Litigation Hold feature. Check out In-Place eDiscovery and In-Place Hold in the New Exchange - Part I and Part II for more info.


Yeah, I know there's maybe 0.57 reader who will find this interesting or useful, but hey, those 0.57 rock! And this is something for my day life that might come in handy on day... so there.

That said, do I really need to make the ESI/Litigation speech? No? Because if your in IT can have any kind of data storage in your realm, you already know that it's only a matter of time? Ok, good...


Related Past Post XRef:
Exchange Online getting serious about helping with eDiscovery

Monday, January 06, 2014

PDF's on PDF's... The complete Acrobat PDF Reference Library

Yep, I'm back from my holiday hiatus, and what better way to celebrate than to highlight a bunch of PDF's about PDF's... :)

Inside PDF  - Now Available: Complete Collection of Adobe PDF References

This is a project that I’ve had on my “To Do” list for a while now and I finally had some time today to complete it.

The complete set of Adobe PDF References (1.0-1.7) as well as all associated errata and addenda can now be found on the Acrobat Engineering site.  I’ve also included the Adobe version of ISO 32000-1:2008 (the ISO standard for PDF) and some relevant tech notes as well.


Adobe PDF References

This page contains links to every version of the PDF Reference published by Adobe as well as associated errata and addenda to the document.


This is one of those pages you'll never be able to find in the future, though you know you saw it "somewhere"...


Related Past Post XRef:
Spelunk the technical details of the PDF format with "PDF Succinctly" from Syncfusion (Free/reg-ware PDF/Mobi ebook)
PDF 1.7 Released to ISO for Standardization

Thursday, November 21, 2013

Office/Exchange File Format,Specification and Protocol Documentation refreshed

Microsoft Office File Formats Documentation

The Microsoft Office file formats documentation provides detailed technical specifications for Microsoft proprietary file formats.

The documentation includes a set of companion overview and reference documents that supplement the technical specifications with conceptual background, overviews of file format relationships and interactions, and technical reference information.

Date Published:



File name:
File size:

70.8 MB


137 KB


356 KB


419 KB


19.1 MB


485 KB


139 KB


840 KB


23.3 MB


1.6 MB


2.8 MB


766 KB


5.9 MB


1.4 MB


5.9 MB


3.1 MB


3.5 MB


6.3 MB


2.9 MB


1.1 MB


23.3 MB


5.7 MB


599 KB


3.8 MB


41.5 MB


41.1 MB

Microsoft Office Protocol Documentation

The Office protocol documentation provides detailed technical specifications for Microsoft proprietary protocols (including extensions to industry-standard or other published protocols) that are implemented and used in Microsoft Office client programs to interoperate or communicate with Microsoft products.

The documentation includes a set of companion overview and reference documents that supplement the technical specifications with conceptual background, overviews of inter-protocol relationships and interactions, and technical reference information.

Date Published:



File name:
File size:

59.4 MB


147 KB


2.5 MB


916 KB


1,009 KB

Word, Excel, and PowerPoint Standards Support

This documentation provides detailed support information for the Open Document Format (ODF) and Open XML (ECMA-376 and ISO/IEC-29500) file formats implemented in Microsoft Word, Microsoft Excel, and Microsoft PowerPoint.

Date Published:



File name:
File size:

38.8 MB


140 KB


11.1 MB


3.3 MB


2.4 MB

Microsoft Exchange and Microsoft Outlook Standards Documentation

The Microsoft Exchange and Microsoft Outlook standards documentation describes how Exchange and Outlook support industry messaging standards and Requests for Comments (RFCs) documents about iCalendar, Internet Message Access Protocol – Version 4 (IMAP4), and Post Office Protocol – Version 3 (POP3).


Date Published:



File name:
File size:

4.0 MB


143 KB


668 KB


710 KB


2.3 MB

That's some lite reading for the coming holidays... :)


Related Past Post XRef:
Microsoft Format and Specification Documentation 0712 Refresh (Think Office 2013 CP update). Oh and some SharePoint Doc's too
Microsoft Format and Specification Documentation Refresh ("Significantly changed technical content") [Updated: Includes updates for Office 15 Technical Preview ]
Microsoft Office File Formats and Microsoft Office Protocols Documentation Refreshed
Microsoft Office File Formats and Protocols documentation updated for Office 2010 (Think “Now with added ‘X’ flavor… DocX, PptX, XlsX, etc”)

Microsoft Open Specifications Poster

XAML Language Specification (as in the in the full XAML, WPF and Silverlight XAML Specs)

"Microsoft SQL Server Data Portability Documentation"

MS-PST file format specification released. Yep, the full and complete specification for Outlook PST’s is now just a download away.
Microsoft Office (DOC, XLS, PPT) Binary File Format Specifications Released – We’re talking the full technical specification… (The [MS-DOC].pdf alone is 553 pages of very dense specification information)
DOC, XLS and PPT Binary File Format Specifications Released (plus WMF, Windows Compound File [aka OLE 2.0 Structured Storage] and Ink Serialized Format Specifications and Translator to XML news)