When your customers stopped sending clean CSVs: what AI makes possible with unstructured files today

By Alain Tiemblo, Co-founder & CTO at WeTransform

The format shift: AI makes format multiplication possible

For years, the CSV was the lingua franca of B2B data exchange. Every B2B SaaS company built its import layer around it: define a template, communicate it to customers, validate on upload. Imperfect, but manageable.

Then something shifted as those companies grew. The operations manager at a mid-size distributor does not export from a database. She photographs purchase orders with her phone. The accounting team at a logistics partner sends supplier invoices as PDF attachments. The new retail customer you just signed uses a proprietary Excel format: merged cells, color-coded rows, and headers spanning three columns.

The template you defined in 2021 now covers perhaps 60% of what arrives at your import layer. The rest is unstructured: PDFs, scanned documents, images, and Excel files designed for humans to read, not machines to parse.

This is not an edge case. It is the natural shape of growth. As you onboard more customers across more verticals, file format diversity grows faster than your headcount.

Understanding how AI handles messy CSV files is a starting point for the structured case. The challenge with unstructured documents runs deeper.

Why extraction alone solves only part of the problem

The common response to 'our customers send PDFs' is to reach for OCR. Reasonable, but it frames the problem too narrowly. Modern AI document understanding handles one step of a five-step problem.

The full chain required to turn a document into usable data:

Extract: pull text and table data from the document
Map: match extracted fields to your target schema
Transform: apply business rules (normalization, derived fields, conditional logic)
Validate: check constraints (required fields, format rules, value ranges)
Deliver: output in the format and via the channel your system expects

The ecosystem has largely solved step 1. The hard work, the real source of error and manual labor, lives in steps 2 through 5.

Extracted text with no schema mapping is not useful data. It is raw output. Someone still has to connect 'Prix unitaire HT' to unit_price_excl_tax, apply normalization rules, and catch the grand total a supplier buried in the line items.

The bottleneck did not disappear with better extraction. It shifted downstream.

This is fundamentally different from the structured file import problem, where structure is already declared. With unstructured files, structure must be inferred, guided, and validated at every step.

Five-step B2B data transformation chain: extract, map, transform, validate, deliver — steps 2 to 5 are where manual work concentrates

What a real AI transformation pipeline looks like

WeTransform's Autoclean feature was built for exactly this problem. When a PDF, image, or poorly structured Excel file arrives in WeTransform, Autoclean triggers automatically, with no configuration required on the upload side.

Phase 1: Extraction with instruction context

The AI reads the document. If you have configured an instruction prompt for this document class, it uses it to guide the extraction:

This is a supplier invoice. Extract only the line items from the table starting with the column 'Ref. article'. Ignore rows where the unit quantity is zero. Normalize all prices to euros using two decimal places.

That instruction is in plain language. It runs automatically on every subsequent document matching this supplier template. No engineering involvement required.

Phase 2: Standard pipeline takes over

Once Autoclean produces structured output, the data flows into the standard pipeline: AI field mapping, business rules, validation, and error correction in Finalize. Anomalies surface for human review before delivery.

Phase 3: Delivery

Clean, mapped, validated records exit via webhook or file download in your target format. Configuration takes minutes the first time. The same supplier next month: fully automatic.

WeTransform Autoclean: column-header detection step with 231 rows detected and the Try AI Autoclean panel highlighted

See Autoclean handle a real document →

Complex instructions: telling the AI exactly what you need

What separates a genuine transformation pipeline from a document parser is the ability to express business logic, not just extraction intent.

In practice, production configurations involve instructions like these:

"This document has two tables. Extract from the second table only, which begins after 'Détail des commandes'. The first table is a summary; ignore it."
"If the 'Remise' column contains a percentage, apply that discount to the unit price and write the result in the 'net_price' field. If 'Remise' is empty, copy 'unit_price' to 'net_price' unchanged."
"The document may be in French or English depending on the supplier. Target field names are always in English."
"Treat any row where 'Statut' reads 'Annulé' or 'En attente' as a cancelled line. Do not include it in the output."

None of this requires code. It requires someone who understands the document and can describe what they need. The instructions travel with the template. Every time a matching document arrives, the same logic runs consistently, without supervision.

This is the meaningful distinction between tools that extract and tools that transform. Extraction produces raw output. Transformation produces data your system can actually use.

A complete workflow: PDF invoice arriving as an ERP-ready record

Consider a B2B services company receiving PDF invoices from 180 suppliers. Each uses a different layout: different column names, different table positioning, some in French, others in English. The target system is an ERP expecting a JSON record with a fixed field schema.

Before a transformation pipeline

Support team downloads each invoice PDF
Line items transcribed manually into a spreadsheet
Column names reformatted to match ERP field names
Result uploaded; validation errors corrected manually
Average time: 25 minutes per invoice × 180 per month = 75 hours of manual work monthly

With WeTransform Autoclean

Supplier sends PDF (via email forwarded to WeTransform, direct upload, or FTP)
Autoclean triggers automatically and reads the document
Instruction prompt for this supplier class guides field extraction
AI mapping connects extracted fields to the target ERP JSON schema
Validation surfaces anomalies: zero-price rows, missing mandatory fields
Webhook delivers the clean JSON record to the ERP
Human review covers flagged anomalies only, typically under 5% of total volume

Initial configuration: one afternoon to define templates for the main supplier classes. A new supplier with a different layout: a new template, 15 minutes.

Browse B2B data import use cases to see how this pattern applies across logistics, accounting, and marketplace onboarding.

Manual PDF invoice processing vs. WeTransform Autoclean: 75 monthly hours reduced to review of under 5% of flagged records

Why hard cases make easy cases free

There is a useful principle in systems design: optimize for the hardest realistic case your pipeline will encounter, and everything easier becomes trivially handled.

A pipeline designed for a 40-page logistics manifest, multi-table structure, mixed languages, nested line items, handles a 2-page supplier invoice with no extra configuration. Same extraction architecture. Same instruction framework. Same validation layer.

The inverse does not hold. A pipeline tuned for simple, predictable documents breaks the moment a customer sends something outside its assumptions. The engineering team patches it. Support escalates. Onboarding stalls. The cost was invisible in the original build estimate, but it shows up clearly in the maintenance log.

Autoclean was designed with document complexity as the baseline. Multi-table PDFs. Scanned documents with variable image quality. Excel files with merged cells and conditional formatting. For simple, clean documents (a single-table PDF with clear column headers), Autoclean handles them automatically, without an instruction prompt. Complexity was the design target. Simplicity is included by default.

Autoclean covers three document complexity levels — simple PDFs, formatted Excel files, and complex scanned manifests — with the same pipeline, no extra configuration

What this means for your engineering team's build scope

The question for a B2B SaaS team evaluating this architecture is not whether to solve the unstructured file problem. It will need to be solved. The question is whether to build the solution or embed it.

Building provides control, but creates a surface area to maintain indefinitely: document parsing logic, layout variation handling, field mapping infrastructure, validation rules, language normalization. Teams that have built this recognize the pattern: it never feels complete, because the documents never stop varying.

Embedding places that surface area with a platform designed to maintain it continuously. Your engineering team owns the webhook endpoint, the downstream schema, and the logic specific to your product. The extraction and transformation layer is not your competitive differentiator. It is infrastructure.

What to look for when evaluating a data import solution covers the questions worth asking before choosing a path.

The decision depends on scale. Ten suppliers with stable, known PDF formats? Building is viable. Fifty customers sending documents in formats that are unknown and growing toward two hundred? Engineering time compounds quickly, and the opportunity cost against your core product compounds with it.

Your competitive advantage is not how you process supplier invoices. It is what your product does with the data once it arrives clean.

See WeTransform pricing →

If you want to see how this works on documents your team actually receives, book a 20-minute demo. We will run Autoclean on a real file from your context.

TagsAI Technical Data formats Tutorial