← Back to Blog

File Splitting Strategies for Large-File Code Conversion

You have a 50,000-line legacy file. It's a COBOL copybook, a PHP god-class, or a Fortran simulation module that nobody dared to refactor for fifteen years. You need to convert it. And no LLM on earth can process 50,000 lines in one shot.

That's the context window problem, and how you solve it determines whether your conversion produces usable output or garbled fragments.

Why You Can't Just Feed It All In

LLMs have context windows — the maximum amount of text they can process at once. Even the largest models (200K+ tokens as of 2025) can't handle a 50,000-line file because:

  • The file itself consumes most of the context window, leaving little room for the conversion instructions and the generated output
  • Output quality degrades significantly as context length increases
  • The model needs headroom for the converted output, which may be longer than the input (especially when converting from terse languages to verbose ones)

In practice, a token budget of roughly 35,000 tokens per batch — including both input and expected output — produces the best results.

Naive Splitting vs. Intelligent Splitting

Naive splitting chops the file at fixed line counts (split every 500 lines). This is fast and predictable, but it routinely cuts through the middle of functions, classes, or logical blocks. The converter gets half a function as input and produces half a function as output. Reassembly is a nightmare.

Intelligent splitting cuts at logical boundaries: between function definitions, between class declarations, between module sections. Each chunk is a complete semantic unit that can be converted independently.

# Naive: splits mid-function
def calculate_risk():   # chunk 1 ends here
    ... 200 lines ...
    return result        # chunk 2 starts here — converter has no context

# Intelligent: splits between functions
def calculate_risk():   # complete function in chunk 1
    ... 200 lines ...
    return result

def validate_input():   # complete function in chunk 2
    ...

The Reassembly Challenge

Even with intelligent splitting, reassembly isn't trivial. Each converted chunk needs to be merged into a single cohesive output file. The problems:

Duplicate imports. If chunk 1 and chunk 3 both reference HashMap, the converter adds import java.util.HashMap to both chunks. Reassembly needs to deduplicate.

Shared state. If chunk 1 defines a class-level variable and chunk 2 uses it, the reassembled file must preserve the definition order and scope.

Consistent naming. The converter might translate the same variable name differently in different chunks if it doesn't see the full context. Chunk 1 might produce userList while chunk 3 produces users for the same variable.

The solution: keep chunks of the same file together as atomic units during processing, share context (at minimum, the file's import section and class signatures) across chunks, and run a deduplication and consistency pass during reassembly.

B&G CodeFoundry handles this with a splitter module that cuts large files at function/class boundaries, batches by token budget (35K tokens per batch), and keeps chunks from the same file together as atomic processing units.

Token Budgeting

A common mistake: allocating the entire token budget to the input. You need to reserve space for:

  • System prompt / instructions (~500-1,000 tokens)
  • Input code (the chunk being converted)
  • Output code (often 1.2-1.5x the input length, more for terse → verbose conversions like Python → Java)

If your budget is 35K tokens, roughly 15K goes to input, 15K is reserved for output, and 5K covers instructions and overhead. Adjust the ratio based on the language pair — COBOL → Java tends to produce longer output than the input.


References: Anthropic and OpenAI context window documentation; research on code chunking strategies; token estimation heuristics for programming languages.