Best Practices 5 min read by syncopio Team

The File Count Lie: Why 1.2 Million Files Doesn't Tell You How Long Migration Takes

File count is useless for migration planning. What matters is the size distribution: a million tiny files and a million large files take completely different amounts of time.

Your boss asks: “How long will the migration take?” You check the file count. 1.2 million files. You estimate four hours. It takes nineteen.

Or the reverse. You tell your boss nineteen hours, block the weekend, send the maintenance notice. It finishes in two. You look like you don’t know what you’re doing.

Both happen because file count, on its own, is a useless number. It tells you almost nothing about how long the migration will actually take.

A million files could be 80GB or 1.5TB

Here’s the problem. “1.2 million files” could mean:

  • 1.2 million tiny config files averaging 4KB each. Total: about 5GB. Mostly metadata overhead.
  • 1.2 million video renders averaging 400MB each. Total: about 480TB. Mostly raw throughput.

Same file count. Completely different workloads. Completely different timelines. Completely different bottlenecks.

The number that actually matters is the size distribution: how many files fall into each size bracket, and how many bytes each bracket accounts for.

The bimodal trap

Real datasets are almost never uniform. A typical NAS has millions of tiny files (thumbnails, metadata, logs) AND a handful of enormous ones (database dumps, video archives, VM images). The tiny files dominate the count. The huge files dominate the bytes. Planning for one and ignoring the other guarantees a bad estimate.

Small files: death by a thousand opens

When most of your data is small files (under 128KB), the actual data transfer is trivial. A gigabit link can push 5GB in under a minute. That’s not where the time goes.

Every file requires:

  1. open() on source
  2. stat() for metadata
  3. open() + create() on destination
  4. write() the data (fast, it’s tiny)
  5. chmod() to set permissions
  6. chtimes() to preserve timestamps
  7. close() on both sides

Seven syscalls minimum per file. On NFS, each of those is a network round-trip. At 1ms per round-trip, that’s 7ms per file. Multiply by a million files: almost two hours of pure overhead, moving barely any data.

Small file workloads are IOPS-bound. The disk and network can handle the bytes easily. They choke on the operations. Your NAS might do 10,000 IOPS. Your million tiny files need 7 million operations. That’s nearly 12 minutes of IOPS alone, assuming zero contention. In practice, with metadata journaling and directory updates, triple it.

Quick diagnostic

If du -sh says 50GB but find | wc -l says 2 million files, you’re IOPS-bound. Parallel workers help more than bandwidth here.

Large files: the bandwidth wall

Flip it around. You have 50,000 files averaging 2GB each. That’s 100TB. The per-file overhead is negligible: 50,000 opens and closes are nothing. The bottleneck is raw throughput.

On a 10Gbps link, theoretical max is about 1.1GB/s. In practice, with NFS overhead, you’ll see 400 to 700 MB/s. At 500MB/s, your 100TB takes about 55 hours. No amount of IOPS optimization changes that number. You need faster links, or you need multiple parallel streams per file.

Large file workloads are throughput-bound. More workers don’t help much unless each worker can saturate its own stream. What helps is bigger TCP windows, larger NFS read/write sizes (rsize/wsize), and making sure your network isn’t the bottleneck.

The distribution is what matters

Here’s a realistic breakdown from a mid-size engineering NAS:

Size bracketFiles% of countBytes% of bytes
< 4KB380,00031.7%0.8 GB0.1%
4KB to 128KB290,00024.2%12 GB1.0%
128KB to 10MB340,00028.3%420 GB34.4%
10MB to 1GB185,00015.4%490 GB40.2%
1GB to 100GB4,8000.4%296 GB24.3%
> 100GB120.001%——

Look at those numbers. 56% of the files are under 128KB, accounting for barely 1% of the data. Meanwhile, 0.4% of files (the 1GB+ bracket) account for nearly a quarter of the bytes. Your time estimate needs to account for both workloads running simultaneously.

Profile your dataset before you estimate

Before you commit to a cutover window, run these commands on the source. Takes a few minutes, saves you from a bad estimate.

Quick file count by size bracket:

find /path/to/data -type f -printf '%s\n' | awk '
  { s=$1;
    if (s < 4096) a++;
    else if (s < 131072) b++;
    else if (s < 10485760) c++;
    else if (s < 1073741824) d++;
    else e++;
    total += s;
  }
  END {
    printf "< 4KB:    %8d files\n", a;
    printf "4K-128K:  %8d files\n", b;
    printf "128K-10M: %8d files\n", c;
    printf "10M-1G:   %8d files\n", d;
    printf "> 1GB:    %8d files\n", e;
    printf "\nTotal: %.1f GB\n", total/1073741824;
  }'

Top 20 largest files (your throughput bottleneck):

find /path/to/data -type f -printf '%s %p\n' | sort -rn | head -20

Directory with the most files (your IOPS hotspot):

find /path/to/data -type d -exec sh -c 'echo "$(find "$1" -maxdepth 1 -type f | wc -l) $1"' _ {} \; | sort -rn | head -10

Why directories matter too

A single directory with 500,000 files in it is dramatically slower to process than 500 directories with 1,000 files each. Directory lookups in large directories are not O(1). On ext4/XFS, they’re roughly O(n) for the first access before the dentry cache warms up. On NFS, each readdir call returns a limited batch, so listing a 500K-entry directory requires hundreds of round-trips.

How distribution changes your tool configuration

Knowing your distribution lets you tune your migration properly.

Mostly small files (IOPS-bound):

  • More parallel workers (8 to 16). Each worker handles a different directory.
  • Smaller buffer sizes. No point allocating 64MB buffers for 4KB files.
  • Group files by directory to exploit dentry cache locality.
  • Consider noatime mount options to reduce metadata writes.

Mostly large files (throughput-bound):

  • Fewer workers (2 to 4) with larger stream buffers.
  • Increase NFS rsize/wsize to 1MB (default is often 64KB or 256KB).
  • Check your network with iperf3 first. If the link can’t sustain the rate, more workers just cause congestion.
  • Enable jumbo frames if your switches support it. 9000 MTU reduces per-packet overhead significantly for large sequential writes.

Mixed (bimodal):

  • This is most real datasets. You need workers that can handle both.
  • Separate the workload if possible: small-file directories on one set of workers, large-file directories on another.
  • The total time is roughly: max(small_file_time, large_file_time), not the sum, if you can parallelize them.

A rough estimation formula

This won’t be perfect, but it’s better than guessing from file count alone.

For the small-file portion:

time_small = (file_count × ops_per_file × latency_per_op) / parallel_workers

Example: 500,000 small files, 7 ops each, 1.5ms average latency (NFS), 8 workers:

(500,000 × 7 × 0.0015) / 8 = 656 seconds ≈ 11 minutes

For the large-file portion:

time_large = total_bytes / (throughput_per_stream × parallel_streams)

Example: 800GB in large files, 400MB/s per stream, 4 streams:

800,000 MB / (400 × 4) = 500 seconds ≈ 8 minutes

Total estimate: max(11, 8) = about 11 minutes if both run in parallel. Add 20 to 30% for overhead (directory creation, retries, verification). Call it 15 minutes.

Compare that to the naive estimate: “1.2 million files at 2,000 files/sec = 10 minutes.” Same ballpark here, but on a different dataset the naive estimate could be off by 5x or more.

syncopio advantage

syncopio’s discovery scan profiles your dataset before transfer starts. It shows file count by size bracket, identifies IOPS hotspots (directories with high file counts), and calculates per-bracket transfer estimates. You see the distribution before committing to a cutover window, not after.

The real answer to “how long?”

Next time your boss asks, don’t say “1.2 million files.” Say: “The dataset is 1.2 million files, but 800GB of it is in 5,000 large files and the rest is small-file overhead. Based on the distribution, I’m estimating three hours with our current network, plus a buffer for verification. I’ll have a tighter number after the discovery scan.”

That’s an answer you can defend. File count alone is not.


Further reading:

Ready to simplify your migrations?

See how syncopio can save you hours on every migration project.

Request a Demo