Blog

Building an AI Agent That Enters Sales Orders in NetSuite

Tech

March 5, 2026

Tymon Terlikiewicz

CTO at Gralio

Free AI consultation

How we automated a 15-minute manual process into a 1-minute AI workflow

At Gralio we build AI transformation for businesses. Not chatbots. Not copilots. Real process automation that replaces repetitive human work end-to-end. Our approach starts with process mapping, task mining, and gathering decision context before writing a single line of agent code.

This post walks through how we built an AI agent that reads customer emails, resolves entities in NetSuite, and creates complete sales orders -- a process that typically takes a human 15 minutes for a complex order. The agent does it in under a minute. For a 15-person order entry team, that's over 150 hours per week returned to higher-value work.

We'll share real code, real gotchas, and real design decisions.

The Problem

A wholesale distribution company receives hundreds of B2B orders daily via email. Each order follows roughly the same process: read the email (sometimes with Excel/PDF attachments), identify the customer in NetSuite (complicated by duplicate accounts across store locations), look up item IDs for each product, determine the correct warehouse location, enter the sales order with the right PO number, memo, and shipping details. Oh and handle edge cases: drop-ship items, quotes, will-call pickups, ....

A trained employee takes about 15 minutes per complex order. The work is cognitively demanding -- not because any single step is hard, but because the decision space is wide and mistakes are costly (wrong customer, wrong warehouse, wrong quantities).

This is exactly the kind of process AI agents excel at: structured but variable, requiring search and judgment, with clear success criteria.

We started by mapping the processes using Gralio. Our tool watched a team of 15 professionals perform various tasks over a period of two weeks. We then automatically generated detailed process maps and documents covering exceptions, edge cases and tribal knowledge.

Architecture Overview

The system has four layers:

‍

Email Inbox (Microsoft 365) ↓ Email Classifier (determines if email is an order) ↓ AI Agent (Gemini with tool-calling) ↓ NetSuite REST API (SuiteTalk + SuiteQL)

‍

The agent is a tool-calling loop powered by Google Gemini. It receives the email as a prompt, uses tools to search NetSuite for entities and items, reviews past orders for context, and submits the final sales order. The entire flow is orchestrated with the Vercel AI SDK.

‍

const agent = new ToolLoopAgent({
    model: geminiModel,
    tools: await getTools(state),
    stopWhen: [stepCountIs(60), hasToolCall('done')],
    timeout: {
        totalMs: ms('28 minutes'),
        stepMs: ms('100 seconds'),
    },
    providerOptions: {
        google: {
            thinkingConfig: {
                thinkingLevel: 'low',
            },
        },
    },
})

‍

A few things worth noting here. The step limit of 60 is high -- complex orders with drop-ship splitting genuinley need 40+ tool calls. The per-step timeout of 100 seconds catches LLM hangs without killing legitimate long-running operations. And thinkingLevel: 'low' was a deliberate optimization: we tested all levels and found low produced equivalent accuracy at 2.2x the speed for this task. More on evaluation later.

Authenticating with NetSuite

Before you can build anything useful on NetSuite, you need to survive OAuth 1.0 authentication with HMAC-SHA256. This is not straightforward.

Critical gotcha: Query parameters must be included in the OAuth signature base string, but only oauth_* parameters belong in the Authorization header. This one cost us a full day of debugging. Our solution:

‍

async makeAuthenticatedRequest(method, endpoint, queryParams?, body?) {
    const oauthParams: Record<string, string> = {
        oauth_consumer_key: this.config.consumerKey,
        oauth_token: this.config.accessToken,
        oauth_signature_method: 'HMAC-SHA256',
        oauth_timestamp: getTimestamp(),
        oauth_nonce: generateNonce(),
        oauth_version: '1.0',
    }

    // Query params must be included in signature computation
    if (queryParams) {
        Object.entries(queryParams).forEach(([key, value]) => {
            oauthParams[key] = value
        })
    }

    const normalizedParams = normalizeParameters(oauthParams)
    const baseString = buildSignatureBaseString(method, url, normalizedParams)
    const signature = generateSignature(
        baseString, consumerSecret, accessTokenSecret
    )

    oauthParams.oauth_signature = signature

    // But only oauth_* params go in the header
    const authHeader = buildAuthorizationHeader(oauthParams, accountId)
}

‍

More NetSuite surprises

POST creating a sales order returns HTTP 204 with no body. The created record ID is in the Location header. If you only parse response.json(), you'll miss it entirely.

We wrote a comprehensive test suite that validates our signature generation against NetSuite's documented reference values. If you're building a NetSuite integration, write these tests first.

Designing Tools for the Agent

The tools are the most important part of the system. Each tool is a function the LLM can call, with a Zod schema defining its inputs and a natural language description guiding when and how to use it.

Our agent has seven tools, among those:

ToolPurposefind_entitySearch NetSuite entities by name, email, addresssubmit_orderCreate a sales order in NetSuitedoneSignal task completion with success status and reasoning

Design Decision: Letting the LLM Write SQL

For search tools, we made a deliberate choice: the LLM writes raw SuiteQL conditions rather than filling structured filter parameters.

export function findEntitiesTool(netsuiteClient: NetSuiteClient) {
    return tool({
        description: `Find NetSuite entities by email address or any
            other suiteql condition. The condition will be inserted into
            the where clause of the query. Queryable fields include:
            (e.email, e.entityid, e.fullname, e.phone, ea.state,
            ea.city, ea.zip). Do not use e.companyname with this tool.
            Use the findCustomers tool for companyname searches.
            The results are limited to 20 records.`,
        inputSchema: z.object({
            suiteqlCondition: z.string().describe(
                "The suiteql condition to find the contact by. " +
                "Example: e.fullname like '%John%' and " +
                "e.email like '%@example.com'"
            ),
        }),
        execute: async ({ suiteqlCondition }) => {
            // make a POST request against /services/rest/query/v1/suiteql
        },
    })
}

‍

Why give an LLM raw SQL access? Because customer lookup is the hardest part of the process. Customers have multiple accounts, inconsistant naming, various email addresses. The agent needs to construct complex search conditions: e.fullname like '%Houston%' and ea.state = 'TX', or search by email domain when the name doesn't match. A structured filter API would be too restrictive.

This is safer than it sounds. The SuiteQL endpoint (/services/rest/query/v1/suiteql) only accepts SELECT queries -- no mutations are possible regardless of what the LLM generates. When the agent writes an invalid query, NetSuite returns an error, which flows back to the LLM as a tool output. The agent reads the error, adjusts its query, and retries. We've watched it recover from typos in column names, incorrect JOIN syntax, and overly broad LIKE patterns. It is quite satisfying to watch honestly. Results are capped at 20 rows to preserve the context window. If the agent doesn't find what it's looking for, it simply issues a different query with adjusted conditions -- often narrowing or broadening the search based on what came back.

The query selects specific columns rather than SELECT *. This is a context window optimization: NetSuite item records have 100+ fields, but the agent only needs ~15 to make decisions. Every token matters when the agent is processing 40+ tool calls.

Design Decision: Past Orders as Context

The most impactful tool is the one that fetches recent sales orders for a customer. This single tool solves three problems at once: entity disambiguation, warehouse selection, and duplicate detection.

export function findSalesOrdersTool(client: NetSuiteClient) {
    return tool({
        description: 'Get the 5 most recent sales orders for a given entity ID.',
        inputSchema: z.object({
            entityId: z.string().describe(
                'The NetSuite entity ID to search for sales orders.'
            ),
        }),
        execute: async ({ entityId }) => {
            return await findLast5SalesOrdersByEntity(entityId, client)
        },
    })
}

The implementation fetches the 5 most recent sales orders, then normalizes each one down to essential fields.

A raw NetSuite sales order has 200+ fields. The normalized version has 10. This normalization is critical -- without it, 5 sales orders would consume most of the agent's context window, leaving no room for reasoning.

Design Decision: Typed Submission with Guardrails

The sales order submission tool uses a strict Zod schema that acts as both validation and documentation for the LLM:

export const createSalesOrderInputSchema = z.object({
    entity: z.object({
        id: z.number().int().positive()
            .describe('The customer / contact / company entity ID.'),
    }),
    item: z.object({
        items: z.array(z.object({
            item: z.object({
                id: z.number().int().positive(),
            }),
            quantity: z.number().positive().describe(
                'The quantity of the item. Must be accurately based on the email.'
            ),
        })),
    }),
    location: z.object({
        id: z.number().int().positive().describe(
            'The warehouse location ID.'
        ),
    }),
    otherRefNum: z.string().describe(
        'The PO number from the client. If no PO, add a single word ' +
        'about the order contents, like "Tees".'
    ),
    memo: z.string().describe(
        'Fill this if there are special instructions for the warehouse.'
    ),
    shippingAddress: z.object({
        addressee: z.string(),
        addr1: z.string(),
        addr2: z.string().optional(),
        city: z.string(),
        state: z.string(),
        zip: z.string(),
        country: z.string().default('US'),
    }).optional().describe(
        'Override shipping address for drop ship orders only.'
    ),
})

The .describe() annotations on each field are not documentation for developers -- they're instructions for the LLM. The description on location.id says "Based on previous orders of this client," which teaches the agent the decision-making process, not just the data format. This is a subtle but important point: your Zod schemas become part of the prompt.

Design Decision: The `done` Tool as a Structured Exit

Every agent run must end by calling the done tool:

done: tool({
    description:
        'Call the done tool when you have completed the task by either ' +
        'submitting the sales order successfully or by failing to find ' +
        'required information.',
    inputSchema: z.object({
        success: z.boolean().describe('Whether the task was completed successfully.'),
        summary: z.string().describe(
            'A summary of the task and the actions taken. ' +
            'Explain your decisions (why did you pick a certain entity ID ' +
            'or location ID given multiple candidates).'
        ),
    }),
    execute: async ({ success, summary }) => {
        await db.email_agent_log.update({
            where: { trace_id: state.traceId },
            data: { agent_success: success, agent_reasoning: summary },
        })
        return 'thanks'
    },
})

‍

If the agent hits the 60-step limit without calling done, we throw an error:

throw new Error(`Agent completed without calling done tool (traceId: ${state.traceId})`)

This error triggers a retry via our task queue. The done tool pattern gives us three things: guaranteed structured output (every run produces a success flag and reasoning), auditability (the agent explains why it made each decision), and retry semantics (failures are explicit and retryable, not silent).

The System Prompt: A Living Business Rules Engine

The system prompt started as two sentences. After two months of production use, it's 50+ lines.Some rules are better expressed in the system prompt (business logic the LLM needs to reason about). Others are better enforced in code (invariants that should never be violated regardless of LLM behavior). The line between them is a judgment call, but the principle is: if the LLM needs to decide, put it in the prompt. If it must always happen, put it in code.

Side-Channel State: Working Around Framework Limitations

The Vercel AI SDK's tool-calling loop treats tool results as opaque -- the orchestrator can't easily inspect what tools returned. We needed a way to extract structured data (created order IDs, warnings) from tool executions. The solution is a mutable state object passed into tools:

type AgentState = {
    salesOrderIds: string[]
    traceId: string
    netsuiteWarning?: string
    isQuote: boolean
}

// In the submit tool:
execute: async (order) => {
    const result = await submitSalesOrder(client, order, state)
    if (result.salesOrderId) {
        state.salesOrderIds.push(result.salesOrderId)
    }
    return result.message  // LLM sees this
}

// After agent completes:
return {
    ...doneToolInput,
    salesOrderIds: state.salesOrderIds,  // orchestrator sees this
    netsuiteWarning: state.netsuiteWarning,
}

The LLM receives a string message ("Sales order created successfully with ID: 12345"). The orchestrator receives the actual structured data via the state object. This separation prevents the LLM from hallucinating order IDs while still giving the outer system reliable data to work with. Simple pattern but it saved us a lot of headaches.

The Feedback Loop: Human Corrections as Agent Memory

After the agent creates an order, a human reviews it. If they spot an issue -- wrong entity, wrong location, missed item -- they can leave feedback. This feedback is stored and injected into future runs:

const memory = await getAgentMemory()

const systemPrompt = `...
## Shared memory
This section contains feedback items added by employees on past agent runs.
They might contain hints on how to reply or how to deal with certain edge cases.
Treat them with caution as they might be outdated or ambiguous and they were
not reviewed by your developers.
<memory>
${memory}
</memory>
`

‍

This is lightweight RLHF. The model doesn't fine-tune, but it does learn from corrections via in-context learning. The caveat in the prompt ("treat them with caution") is important -- employee feedback can be contradictory or outdated, and the model needs to weigh it against its other instructions.

The most suprising evolution was building a separate feedback agent that sits between the employee and the memory store. When an employee leaves a correction like "this customer always ships from warehouse 6," the feedback agent reads the existing memory entries, the agent's source code, and the current system prompt to phrase the new memory entry in a way the order-entry agent will actually follow. It understands the schema, the field names, the tool structure. Kind of meta but it works remarkably well.

We took this further: if the feedback agent identifies a pattern that looks like a bug or a missing capability -- something that can't be fixed with a memory entry alone -- it automatically creates a GitHub issue with a description of the problem and a suggested code change. The system is effectively iterating on itself. Employees provide business-level feedback ("you picked the wrong warehouse"), the feedback agent translates that into either a memory entry the order agent can use, or an engineering ticket for the development team. The boundary between runtime configuration and code changes becomes a judgment call made by an LLM. It sounds crazy when I write it out but it just works.

Evaluation: How Do You Know It's Working?

This is a topic that deserves its own post (and will get one), but I want to mention it briefly because it's a question we get asked a lot. How do you test an AI agent that makes decisions?

We've experimented with various approaches. Unit tests for individual tools, integration tests against a NetSuite sandbox, replay tests using real email inputs. All useful, all insufficient on their own. The most interesting thing we built is an LLM-based evaluation suite where a separate LLM acts as judge of correctness -- it looks at the input email, the agent's actions, and the resulting sales order, and evaluates whether the agent made the right decisions. Think of it as an AI grading another AI's homework. There are subtleties around making this reliable and avoiding the judge having the same blind spots as the agent. We'll write that up properly in a seperate article.

What We Learned About Frontier LLMs in Production

After two months of processing real orders, some observations.

What LLMs are good at: parsing unstructured emails with wildly inconsistent formatting, constructing complex search queries to disambiguate entities, following multi-step business processes with branching logic, and reasoning about which of several candidate entities is the correct one.

What surprised us: thinkingLevel: 'low' on Gemini produces equivalent accuracy at half the latency for structured tool-calling tasks. Extended thinking helps more for open-ended reasoning. The agent handles Excel attachments (converted to CSV/markdown) as reliably as plain-text emails. Customer disambiguation -- not order entry -- is the hard problem. 70% of our prompt engineering went into entity resolution. The 60-step limit has never been hit on a legitimate order. When the agent exceeds 40 steps, it's almost always stuck in a search loop, and the retry usually succeeds.

The feedback-agent-to-GitHub-issue pipeline was an accident that became a feature. Employee corrections now regularly surface edge cases we'd never have found through testing alone. The system improves faster than we can write code, because the feedback loop runs on every single order.

The Numbers

Before: ~15 minutes per complex order, manual entry. After: ~1 minute per order, fully automated. Team of 15 order entry specialists. 150+ hours per week saved. Token cost less than $20 per month (compare that to hourly rates). Reviewed by humans with feedback loop for continuous improvement.

The humans haven't been removed from the process -- they've been promoted from data entry to quality review and customer success. They review agent output, provide feedback, and handle the genuinley ambiguous cases that require customer communication.

Getting Started

If you're considering building a NetSuite agent, here is what I'd tell you.

Start by mapping your process in detail. Make sure you have a deep understanding of edge cases, exceptions, gaps in documentation. Gralio AI does exactly that.

Design your tools around the decision process, not the API surface. Our tools don't mirror NetSuite's REST API -- they mirror how a human thinks about the task (find customer, check their history, look up items, submit).

Let the LLM write queries. Structured filter parameters are safer but too limiting for real-world entity search. SuiteQL conditions give the agent the flexibility it needs.

Remove capabilities the LLM misuses. Don't validate bad output -- eliminate the possibility of bad output. If the model can't calculate prices correctly, don't let it try.

Log everything before submission. Save the full payload to your database before calling NetSuite. When something goes wrong (it will), you'll need the audit trail.

Build the feedback loop from day one. Human review isn't a crutch -- it's the training signal that makes the system improve over time. This was probably our single best architectural decision.

Written by Tymon Terlikiewicz, CTO of Gralio.Gralio focuses on mapping processes in detail, so that you can start building agentic automations with confidence.