TL;DR

File count alone is useless for estimating migration time. What matters is the size distribution: millions of tiny files are IOPS-bound, large files are throughput-bound. Profile your dataset first, then estimate each portion separately. syncopio’s discovery scan profiles your dataset automatically and the assessment phase gives you accurate transfer time estimates based on real benchmarks — not guesswork.

Your boss asks: “How long will the migration take?” You check the file count. 1.2 million files. You estimate four hours. It takes nineteen.

Or the reverse. You tell your boss nineteen hours, block the weekend, send the maintenance notice to 400 users. It finishes in two. You look like you don’t know what you’re doing.

Both happen because file count, on its own, is a useless number. It tells you almost nothing about how long the migration will actually take.

A million files could be 80GB or 1.5TB

Here’s the problem. “1.2 million files” could mean:

1.2 million tiny config files averaging 4KB each. Total: about 5GB. Mostly metadata overhead.
1.2 million video renders averaging 400MB each. Total: about 480TB. Mostly raw throughput.

Same file count. Completely different workloads. Completely different timelines. Completely different bottlenecks.

The number that actually matters is the size distribution: how many files fall into each size bracket, and how many bytes each bracket accounts for.

The bimodal trap

Real datasets are almost never uniform. A typical NAS has millions of tiny files (thumbnails, metadata, logs) AND a handful of enormous ones (database dumps, video archives, VM images). The tiny files dominate the count. The huge files dominate the bytes. Planning for one and ignoring the other guarantees a bad estimate.

Small files: death by a thousand opens

When most of your data is small files (under 128KB), the actual data transfer is trivial. A gigabit link can push 5GB in under a minute. That’s not where the time goes.

Every file requires:

open() on source
stat() for metadata
open() + create() on destination
write() the data (fast, it’s tiny)
chmod() to set permissions
chtimes() to preserve timestamps
close() on both sides

Seven syscalls minimum per file. On NFS, each of those is a network round-trip. At 1ms per round-trip (typical on a local 10GbE network; cross-site links or busy storage can be 5–20ms), that’s 7ms per file. Multiply by a million files: almost two hours of pure network latency overhead, moving barely any data.

Small file workloads are IOPS-bound. The disk and network can handle the bytes easily. They choke on the operations. Your NAS might do 10,000 IOPS. Your million tiny files need 7 million operations. That’s nearly 12 minutes of raw IOPS capacity alone, assuming zero contention. In practice, with metadata journaling and directory updates, triple it.

Note: these are two separate bottlenecks — NFS network latency and storage IOPS — and whichever is worse becomes your actual limit. On a local LAN, storage IOPS usually dominates. Over a WAN, network latency does.

Quick diagnostic

If du -sh says 50GB but find | wc -l says 2 million files, you’re IOPS-bound. Parallel workers help more than bandwidth here.

Large files: the bandwidth wall

Flip it around. You have 50,000 files averaging 2GB each. That’s 100TB. The per-file overhead is negligible: 50,000 opens and closes are nothing. The bottleneck is raw throughput.

On a 10Gbps link, theoretical max is about 1.1GB/s. In practice, with NFS overhead, you’ll see 400 to 700 MB/s. At 500MB/s, your 100TB takes about 55 hours. No amount of IOPS optimization changes that number. You need faster links, or you need multiple parallel streams per file.

Large file workloads are throughput-bound. More workers don’t help much unless each worker can saturate its own stream. What helps is bigger TCP windows, larger NFS read/write sizes (rsize/wsize), and making sure your network isn’t the bottleneck.

The distribution is what matters

Here’s a realistic breakdown from a mid-size engineering NAS:

Size bracket	Files	% of count	Bytes	% of bytes
< 4KB	380,000	31.7%	0.8 GB	0.1%
4KB to 128KB	290,000	24.2%	12 GB	1.0%
128KB to 10MB	340,000	28.3%	420 GB	34.4%
10MB to 1GB	185,000	15.4%	490 GB	40.2%
1GB to 100GB	4,800	0.4%	296 GB	24.3%
> 100GB	12	0.001%	2.4 TB	—

Look at those numbers. 56% of the files are under 128KB, accounting for barely 1% of the data. Meanwhile, 0.4% of files (the 1GB+ bracket) account for nearly a quarter of the bytes. Your time estimate needs to account for both workloads running simultaneously.

Profile your dataset before you estimate

Before you commit to a cutover window, run these commands on the source. Takes a few minutes, saves you from a bad estimate. Or get a quick baseline first with our data migration calculator.

Quick file count by size bracket:

find /path/to/data -type f -printf '%s\n' | awk '
  { s=$1;
    if (s < 4096)           { a++; ab+=s }
    else if (s < 131072)    { b++; bb+=s }
    else if (s < 10485760)  { c++; cb+=s }
    else if (s < 1073741824){ d++; db+=s }
    else                    { e++; eb+=s }
    total += s;
  }
  END {
    g=1073741824;
    printf "< 4KB:    %8d files  %8.1f GB\n", a, ab/g;
    printf "4K-128K:  %8d files  %8.1f GB\n", b, bb/g;
    printf "128K-10M: %8d files  %8.1f GB\n", c, cb/g;
    printf "10M-1G:   %8d files  %8.1f GB\n", d, db/g;
    printf "> 1GB:    %8d files  %8.1f GB\n", e, eb/g;
    printf "\nTotal: %d files, %.1f GB\n", a+b+c+d+e, total/g;
  }'

Top 20 largest files (your throughput bottleneck):

find /path/to/data -type f -printf '%s %p\n' | sort -rn | head -20

Directory with the most files (your IOPS hotspot):

# Warning: slow on large trees — spawns a subshell per directory.
# On 100K+ directories, expect 10-30 minutes. Worth the wait.
find /path/to/data -type d -exec sh -c 'echo "$(find "$1" -maxdepth 1 -type f | wc -l) $1"' _ {} \; | sort -rn | head -10

syncopio advantage

Don’t want to run these manually? syncopio’s discovery scan profiles your entire dataset automatically — file counts by size bracket, IOPS hotspots, and per-bracket transfer estimates — before you commit to a cutover window.

Why directories matter too

A single directory with 500,000 files in it is dramatically slower to process than 500 directories with 1,000 files each. Directory lookups in large directories are not O(1). On ext4/XFS, they’re roughly O(n) for the first access before the dentry cache warms up. On NFS, each readdir call returns a limited batch, so listing a 500K-entry directory requires hundreds of round-trips.

How distribution changes your tool configuration

Knowing your distribution lets you tune your migration properly.

Mostly small files (IOPS-bound):

More parallel workers (8 to 16). Each worker handles a different directory.
Smaller buffer sizes. No point allocating 64MB buffers for 4KB files.
Group files by directory to exploit dentry cache locality.
Consider noatime mount options to reduce metadata writes.

Mostly large files (throughput-bound):

Fewer workers (2 to 4) with larger stream buffers.
Increase NFS rsize/wsize to 1MB (default is often 64KB or 256KB).
Check your network with iperf3 first. If the link can’t sustain the rate, more workers just cause congestion.
Enable jumbo frames if your switches support it. 9000 MTU reduces per-packet overhead significantly for large sequential writes.

Mixed (bimodal):

This is most real datasets. You need workers that can handle both.
Separate the workload if possible: small-file directories on one set of workers, large-file directories on another.
The total time is roughly: max(small_file_time, large_file_time), not the sum, if you can parallelize them.

syncopio advantage

syncopio auto-tunes worker count and buffer sizes based on your dataset’s size distribution. You set the target, it picks the configuration. See how it works →

A rough estimation formula

This won’t be perfect, but it’s better than guessing from file count alone.

For the small-file portion:

time_small = (file_count × ops_per_file × latency_per_op) / parallel_workers

Using the table above: 670,000 files under 128KB, 7 ops each, 1.5ms average NFS latency, 8 workers:

(670,000 × 7 × 0.0015) / 8 = 879 seconds ≈ 15 minutes

For the large-file portion:

time_large = total_bytes / (throughput_per_stream × parallel_streams)

Using the table: ~786GB in files over 10MB, 400MB/s per stream, 4 streams:

786,000 MB / (400 × 4) = 491 seconds ≈ 8 minutes

Total estimate: max(15, 8) = about 15 minutes if both run in parallel. Add 20 to 30% for overhead (directory creation, retries, verification). Call it 20 minutes.

Compare that to the naive estimate: “1.2 million files at 2,000 files/sec = 10 minutes.” Off by 2x here — and on a different dataset the naive estimate could be off by 5x or more.

syncopio advantage

syncopio’s discovery scan profiles your dataset before transfer starts. It shows file count by size bracket, identifies IOPS hotspots (directories with high file counts), and calculates per-bracket transfer estimates. You see the distribution before committing to a cutover window, not after. Try a free discovery scan →

Estimation cheat sheet

Profile: Run the find | awk script above to get file counts and bytes per size bracket.
Small-file time: (files_under_128KB × 7 × latency_ms) / workers — use 1.5ms for local NFS, 5–20ms for cross-site.
Large-file time: bytes_over_10MB / (throughput × streams) — use 400MB/s per stream as a conservative starting point.
Total: max(small, large) × 1.3 — the 30% covers directory creation, retries, and verification.
Sanity check: Run iperf3 between source and destination. If the link can’t sustain your throughput estimate, that’s your real ceiling.

The real answer to “how long?”

Next time your boss asks, don’t say “1.2 million files.” Here’s what to say instead:

Copy-paste for your next status email

“The dataset is 1.2 million files totalling 1.2TB. Based on the size distribution, roughly 670K files are small-file overhead (IOPS-bound, ~15 min) and 786GB is in large files (throughput-bound, ~8 min at 400MB/s × 4 streams). Estimated transfer time: 20 minutes, plus 30% buffer for verification. I’ll have a tighter number after the discovery scan.”

That’s an answer you can defend. “1.2 million files” is not.

Your boss doesn’t need to understand IOPS or throughput. They need to see that you’ve broken the problem down, identified the bottleneck, and have a number backed by actual data. The profiling takes five minutes. The credibility it buys you is worth the effort.

Further reading:

NFS Mount Options Cheat Sheet (tuning rsize/wsize and mount options)
rsync vs rclone vs Robocopy (how each tool handles parallelism)
5 Signs Your Migration Tool Is Slowing You Down
NFS vs SMB: Performance, Security & When to Use Each

The File Count Lie: Why 1.2 Million Files Doesn't Tell You How Long Migration Takes

A million files could be 80GB or 1.5TB

Small files: death by a thousand opens

Large files: the bandwidth wall

The distribution is what matters

Profile your dataset before you estimate

How distribution changes your tool configuration

A rough estimation formula

The real answer to “how long?”

Ready to simplify your migrations?