x

Agile Insider Blog

file size banner

Comparing File Sizes for Data Stored in CSV, Excel, JSON, and XML

For many organizations and people that produce data for a variety of reasons, there is often a decision to be made regarding which file format to put it in. While some of the decisions are made for you because of the receiving ends’ requirement, there are times when you need to make the decision on your own.

TL;DR

Excel file format is the smaller file size.

We’ll provide more details, but for now, here is a comparison of 50,000 rows of information where they each have the exact same data.

file size comparison chart

As you can see here, Excel is far and away the winner.

Let’s Start From The Beginning

This all started as we were creating a PowerShell script that extracted data from local Active Directory. Simple enough. In order to make the script more portable and allow it to run on a Windows Server 2008 R2 Domain Controller, we needed to run in a PowerShell version that we knew would be there, work, and remove the need to install additional software. Thus, PowerShell version 2.0 was the target (gross, I know).

In doing so, we needed to leverage a file format that would also give us some depth to handle arrays without having to handle those separately. Using the “Export-Clixml” command, we could pass it a collection of objects (e.g. users) and it would write everything out for us.

We have another utility where we push this data into a Cosmos DB so we can better interrogate and export data using SQL versus a pile of different scripts. That said, we needed to convert XML to JSON in this current example.

Using PowerShell 2.0 running on the DC, and writing directly into an XML format is all non-wonderful and not the ideal final direction. However, this is what presented the problem in file format.

Where the issue really presented itself is when we encountered a local Active Directory that produced extremely large file exports. While the script and creation of the XML file went well, the downstream scripts and tools didn’t like working with files so large. This also prevents other PowerShell scripts from working with the data or even opening it up in another tool such as Microsoft VS Code or Notepad++.

Taking Another Approach to Data Exports

At this point, we thought maybe we should export the data in smaller batches. Instead of one large file, we’ll have many files batched by item count. Sounds like a good start, but then we’d need to do this for item types we want to export. With that as the next problem, now we’ll need to compress the file in a zip format to transport that grouping of files. Even in this case, we still have a large file to pass around AND we need to update our up-stream scripts and tools to handle this new structure.

While the overall approach isn’t terrible, it seems like a lot of work just to get over this file size problem.

We had others as that instead of XML or JSON, perhaps something like a CSV format would better. It seems like a lightweight file format as it’s just text and some commas.

The Experiment

At this point, we can see that an experiment needed to be done to help answer the question of which file format with the same data inside would be better in size and allow to be opened by an application directly to interact with it.

To do this, a PowerShell script was used to create a dummy data set of 18 columns of data and would export a number of rows that are equal for each file type. This would allow us to see the difference in size as we ramp up from 1000, 5000, to 50000.

The Results

The following table shows the results from the experiment. Yes, a PowerShell command was used to export the file name, type, and size from the OS. Then a little Excel magic to bring it together. As you can see, Excel is the winner in all cases. The last three columns show the file size difference between Excel and the same row count for the file type.

excel file size comparison

A Little More Excel Information

Since Excel can have multiple worksheets within the same file, we wanted to see how large the files would be when they have additional worksheets (aka Tabs) within the same file.

Rows Tabs Extension Rows kiloBytes
100 1 .xlsx 100 12.36
100 2 .xlsx 200 25.67
500 1 .xlsx 500 46.6
100 10 .xlsx 1000 344.23
500 2 .xlsx 1000 107.76
1000 1 .xlsx 1000 88.81
1000 2 .xlsx 2000 209.73
500 10 .xlsx 5000 1669.93
5000 1 .xlsx 5000 429.48
1000 10 .xlsx 10000 3322.16
5000 2 .xlsx 10000 1034.78
5000 10 .xlsx 50000 16579.98
50000 1 .xlsx 50000 4,245.87
50000 2 .xlsx 100000 10254.04
50000 10 .xlsx 500000 166150.03
500000 1 .xlsx 500000 42351.86

What’s most interesting here is if you have a total row count that is equal between two files but one has only one worksheet and the other is spread across ten worksheets, the multiple worksheets file is much larger in size. This tells us that it’s better to add columns to a worksheet if you can do it versus spreading data into other worksheets to keep the file size down. This isn’t always possible or even the right decision if the actual data is different. Perhaps you need to put that into a different Excel file.

Need Assistance Managing or Migrating your Data Estate?

Agile IT has delivered some of the most complex migrations, mergers, and divestitures around. If you need to reduce technical debt, reign in an overly complex active directory, or gain control over expansive identity problems, Agile IT has the tooling, engineers, and expertise to help. Find out how we can help reduce cost and complexity by telling us a bit about your environment.



 

Leave a comment

Learn More Today

Have questions or want to learn more about the services and solutions Agile IT has to offer?

Schedule a call with us today!

Schedule a Call
or

Request a Quote