What Is A/B Testing? A Practical Guide to Split-Testing Features in AI-Built Apps

Q: What is A/B testing in simple terms?

A/B testing means showing two different versions of something (a button color, a headline, a pricing page) to different users at the same time, then measuring which version gets better results. Version A is the control (what you have now), and Version B is the variant (what you're testing). The version that performs better wins.

Q: How many users do I need for a valid A/B test?

For statistically meaningful results, you generally need at least 1,000 visitors per variant, though the exact number depends on your baseline conversion rate and the size of the improvement you're trying to detect. Small apps with under 100 daily users may need to run tests for weeks or even months. AI often skips this math entirely — always ask it to calculate sample size before launching a test.

Q: Can I A/B test without a third-party service?

Yes. A basic A/B test only needs three things: a way to assign users to groups (a cookie or user ID hash), logic to show different content based on the group, and a way to track which group converts. You can build this with a simple backend endpoint and a database table. Third-party services like LaunchDarkly or PostHog add advanced features like statistical analysis and targeting rules, but they're not required to get started.

Q: What is the difference between A/B testing and feature flags?

Feature flags control whether a feature is turned on or off. A/B testing specifically measures the impact of a change by comparing two groups. Feature flags are the mechanism; A/B testing is the experiment. In practice, A/B tests are often implemented using feature flags — the flag controls which variant a user sees, and analytics measure the outcome.

Q: Why does my A/B test show different results every time I refresh?

If you see different variants on each page load, your assignment isn't being persisted. The user needs to stay in the same group for the entire experiment. Check that: (1) you're setting a cookie or storing the assignment in your database, (2) the cookie isn't expiring between visits, and (3) you're reading the stored assignment before generating a new random one. This is the most common bug in AI-generated A/B test code.

A/B testing shows different versions of your app to different users so you can measure what actually works — not just guess. When you ask AI to build one, here's what it does, what it misses, and how to make it production-ready.

TL;DR

A/B testing means splitting your users into two groups and showing each group a different version of something — a button, a headline, an entire page layout. You track which version gets more signups, clicks, or purchases. The winning version becomes the default. AI can scaffold A/B tests quickly, but it almost always forgets to persist user assignments (so users bounce between variants), skips sample size calculations (so you draw conclusions from 47 visitors), and ignores statistical significance (so you "pick a winner" based on random noise). Fix those three things and you have a real experiment.

Why AI Coders Need to Know This

You built something. It's live. People are using it. Now the question every builder hits: should the signup button say "Get Started" or "Try It Free"? Should the pricing page show monthly or annual pricing first? Should the hero section have a video or a screenshot?

Your gut has an opinion. Your gut is wrong about 60% of the time. That's not a made-up number — Google, Netflix, and Microsoft have all published research showing that most ideas people think will improve metrics actually don't. Microsoft found that only about a third of A/B tested ideas at Bing actually improved the metrics they targeted.

A/B testing replaces gut feelings with data. Instead of arguing about which headline is better, you show both to real users and let the numbers decide.

Here's why this matters specifically for AI-built apps:

AI generates options fast. You can prompt Claude to create five different landing page variants in minutes. But which one actually converts? Without A/B testing, you're just picking the one that looks best to you.
AI-built apps iterate fast. You're shipping changes weekly, maybe daily. A/B testing tells you whether each change actually improved things or made them worse.
You're not a UX researcher. And you don't need to be. A/B testing gives you a structured way to make design decisions without a decade of UX experience.
Your users are telling you what works — you just need a way to listen. A/B testing is that listening mechanism.

The concept is simple. The implementation has gotchas. Let's walk through what happens when you ask AI to build one.

Real Scenario: You Ask AI to Add A/B Testing to Your Signup Flow

You've got a SaaS app. The signup page converts at about 3% — meaning 3 out of every 100 visitors actually create an account. You want to test whether a shorter signup form (just email, no name field) performs better.

Here's the prompt you'd give your AI:

Prompt

I want to A/B test my signup form. Version A is the current form (email + name fields). Version B is just an email field. I need to: randomly assign each visitor to A or B, make sure they always see the same version, track how many people in each group actually sign up, and show me the results. I'm using Node.js with Express and PostgreSQL.

This is a solid prompt because it specifies: the two variants, the requirement for consistent assignment, the metric to track (signups), and the tech stack. Let's see what AI generates.

What AI Generated

When you give Claude this prompt, it produces a complete A/B testing system. Here's the core of what it generates:

-- Database table to track experiments and assignments
CREATE TABLE ab_experiments (
  id SERIAL PRIMARY KEY,
  name VARCHAR(100) NOT NULL UNIQUE,
  variant_a_name VARCHAR(100) DEFAULT 'control',
  variant_b_name VARCHAR(100) DEFAULT 'variant',
  status VARCHAR(20) DEFAULT 'active',  -- active, paused, completed
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE ab_assignments (
  id SERIAL PRIMARY KEY,
  experiment_id INTEGER REFERENCES ab_experiments(id),
  user_identifier VARCHAR(255) NOT NULL,  -- cookie ID or user ID
  variant VARCHAR(1) NOT NULL,            -- 'A' or 'B'
  assigned_at TIMESTAMP DEFAULT NOW(),
  UNIQUE(experiment_id, user_identifier)
);

CREATE TABLE ab_events (
  id SERIAL PRIMARY KEY,
  experiment_id INTEGER REFERENCES ab_experiments(id),
  user_identifier VARCHAR(255) NOT NULL,
  variant VARCHAR(1) NOT NULL,
  event_type VARCHAR(50) NOT NULL,        -- 'view', 'signup', 'click'
  created_at TIMESTAMP DEFAULT NOW()
);

-- Insert our experiment
INSERT INTO ab_experiments (name) VALUES ('signup-form-short');

// ab-testing.js — A/B test middleware and helpers
const crypto = require('crypto');
const { Pool } = require('pg');

const pool = new Pool({
  connectionString: process.env.DATABASE_URL
});

// Get or create a visitor ID from cookies
function getVisitorId(req, res) {
  let visitorId = req.cookies?.ab_visitor;

  if (!visitorId) {
    visitorId = crypto.randomUUID();
    res.cookie('ab_visitor', visitorId, {
      maxAge: 90 * 24 * 60 * 60 * 1000,  // 90 days
      httpOnly: true,
      secure: process.env.NODE_ENV === 'production',
      sameSite: 'lax'
    });
  }

  return visitorId;
}

// Assign a visitor to a variant (or return existing assignment)
async function getVariant(experimentName, visitorId) {
  // Check for existing assignment first
  const existing = await pool.query(
    `SELECT variant FROM ab_assignments a
     JOIN ab_experiments e ON e.id = a.experiment_id
     WHERE e.name = $1 AND a.user_identifier = $2`,
    [experimentName, visitorId]
  );

  if (existing.rows.length > 0) {
    return existing.rows[0].variant;
  }

  // No existing assignment — randomly assign
  const variant = Math.random() < 0.5 ? 'A' : 'B';

  await pool.query(
    `INSERT INTO ab_assignments (experiment_id, user_identifier, variant)
     SELECT id, $2, $3 FROM ab_experiments WHERE name = $1
     ON CONFLICT (experiment_id, user_identifier) DO NOTHING`,
    [experimentName, visitorId, variant]
  );

  return variant;
}

// Track an event (page view, signup, etc.)
async function trackEvent(experimentName, visitorId, eventType) {
  await pool.query(
    `INSERT INTO ab_events (experiment_id, user_identifier, variant, event_type)
     SELECT e.id, $2, a.variant, $3
     FROM ab_experiments e
     JOIN ab_assignments a ON a.experiment_id = e.id
     WHERE e.name = $1 AND a.user_identifier = $2`,
    [experimentName, visitorId, eventType]
  );
}

// Get results for an experiment
async function getResults(experimentName) {
  const results = await pool.query(
    `SELECT
       a.variant,
       COUNT(DISTINCT a.user_identifier) AS total_assigned,
       COUNT(DISTINCT CASE WHEN ev.event_type = 'view'
         THEN ev.user_identifier END) AS views,
       COUNT(DISTINCT CASE WHEN ev.event_type = 'signup'
         THEN ev.user_identifier END) AS conversions
     FROM ab_assignments a
     JOIN ab_experiments e ON e.id = a.experiment_id
     LEFT JOIN ab_events ev ON ev.experiment_id = e.id
       AND ev.user_identifier = a.user_identifier
     WHERE e.name = $1
     GROUP BY a.variant
     ORDER BY a.variant`,
    [experimentName]
  );

  return results.rows.map(row => ({
    variant: row.variant,
    assigned: parseInt(row.total_assigned),
    views: parseInt(row.views),
    conversions: parseInt(row.conversions),
    conversionRate: row.views > 0
      ? (parseInt(row.conversions) / parseInt(row.views) * 100).toFixed(2) + '%'
      : '0%'
  }));
}

module.exports = { getVisitorId, getVariant, trackEvent, getResults };

// In your Express routes
const { getVisitorId, getVariant, trackEvent, getResults } = require('./ab-testing');
const cookieParser = require('cookie-parser');

app.use(cookieParser());

// Signup page — serves the right variant
app.get('/signup', async (req, res) => {
  const visitorId = getVisitorId(req, res);
  const variant = await getVariant('signup-form-short', visitorId);

  // Track the page view
  await trackEvent('signup-form-short', visitorId, 'view');

  // Render the appropriate variant
  res.render('signup', { variant });
});

// Signup submission — track the conversion
app.post('/signup', async (req, res) => {
  const visitorId = getVisitorId(req, res);

  // Track the conversion event
  await trackEvent('signup-form-short', visitorId, 'signup');

  // ... actual signup logic here
});

// Results dashboard (protect this in production!)
app.get('/admin/ab-results', async (req, res) => {
  const results = await getResults('signup-form-short');
  res.json(results);
});

That's a complete, working A/B testing system. Three database tables, a helper module, and route integration. Let's break down what each piece does.

Understanding Each Part

The Database Layer: Your Experiment's Memory

Three tables work together here, and understanding the split matters:

ab_experiments — The experiment registry. Each row is one test you're running. The status column lets you pause or stop experiments without deleting data. Think of this as the "what are we testing?" table.
ab_assignments — The "who sees what?" table. When a visitor first hits your signup page, they get randomly assigned to A or B, and that assignment is stored here. The UNIQUE constraint on (experiment_id, user_identifier) is critical — it prevents a user from being assigned to both groups. This is the piece AI sometimes forgets.
ab_events — The "what happened?" table. Every page view and every signup gets logged here with the visitor's variant. This is your raw data for calculating conversion rates.

If you've read our guide to PostgreSQL, you'll recognize the pattern — structured tables with foreign key relationships. The database does the heavy lifting of remembering who's in which group.

Visitor Identity: The Cookie

The getVisitorId function solves a fundamental problem: how do you know if this is the same person who visited yesterday?

The answer is a cookie. When someone visits for the first time, you generate a random ID (using crypto.randomUUID()) and store it in their browser as a cookie. Next time they visit, the cookie is sent back automatically, and you know it's the same person.

The 90-day expiration means the test can run for up to three months before cookies start expiring. This ties directly into how session management works — A/B testing is essentially a specialized form of "remembering a visitor across requests."

The Assignment Logic: Fair Random Splits

The getVariant function has a critical two-step flow:

Check if this visitor already has an assignment. If yes, return it. This ensures they always see the same variant.
If no existing assignment, randomly pick A or B. Math.random() < 0.5 gives a roughly 50/50 split. Store the assignment so future visits return the same variant.

The ON CONFLICT DO NOTHING clause handles race conditions — if two requests come in simultaneously for the same visitor (which happens more than you'd think), the database ensures only one assignment is stored.

Event Tracking: Counting What Matters

The trackEvent function logs two types of events: views (someone saw the signup page) and signups (someone actually signed up). The conversion rate is signups divided by views.

Why track views separately instead of just using assignments? Because not every assigned user actually sees the page. Someone might get a cookie on the homepage but never navigate to signup. Tracking views gives you the accurate denominator for your conversion rate.

The Results Query: Making Sense of the Data

The getResults function runs a single SQL query that groups everything by variant and counts unique users for each event type. You get output like:

[
  { "variant": "A", "assigned": 1247, "views": 1089, "conversions": 33, "conversionRate": "3.03%" },
  { "variant": "B", "assigned": 1253, "views": 1102, "conversions": 48, "conversionRate": "4.36%" }
]

In this example, the shorter form (B) converts at 4.36% vs the original's 3.03% — a 44% improvement. But is that real, or just random chance? That's where statistical significance comes in (and where AI drops the ball).

What AI Gets Wrong About A/B Testing

AI can scaffold A/B test code in minutes. But it consistently makes the same mistakes that lead to false conclusions and wasted effort. Here are the big ones:

1. No Statistical Significance Check

This is the biggest one. The code above tells you the conversion rates, but it doesn't tell you whether the difference is statistically significant — meaning, is the difference real, or could it just be random variation?

If you flip a coin 10 times and get 7 heads, that doesn't mean the coin is rigged. You need enough flips. Same with A/B testing — you need enough visitors.

What AI Should Have Included

A statistical significance calculator. The standard approach uses a chi-squared test or z-test for proportions. You need a p-value below 0.05 (meaning there's less than a 5% chance the difference is due to random chance). Ask your AI: "Add a statistical significance calculation to the results endpoint using a z-test for proportions."

2. No Minimum Sample Size Calculation

AI will let you "pick a winner" after 50 visitors. That's not a test — that's a coin flip. For a baseline conversion rate of 3% and a minimum detectable effect of 1 percentage point (3% → 4%), you need approximately 4,800 visitors per variant.

For a small app getting 100 visitors per day, that's 96 days of testing. AI never mentions this. It lets you believe you can run a test over a weekend.

Prompt to Fix This

"Before I launch this A/B test, calculate the minimum sample size I need. My current conversion rate is 3%, and I want to detect at least a 1 percentage point improvement with 95% confidence and 80% power. How many visitors per variant do I need, and how long will that take at 200 visitors per day?"

3. Assignment Leakage (The Flickering Problem)

Sometimes AI generates A/B tests where the variant is determined on the client side with JavaScript. This causes a visible "flicker" — the page loads with variant A, then JavaScript swaps it to variant B. Users see the switch. It's jarring, it pollutes your data, and it makes the test unreliable.

The fix: always determine the variant on the server before rendering the page. The code above does this correctly by assigning variants in the Express route handler before calling res.render().

4. No Test Duration Limits

AI generates code that runs tests forever. In practice, you should set a maximum duration (typically 2–4 weeks) and a minimum sample size. If you haven't reached significance after the max duration, the test is inconclusive — not a failure, just not enough signal.

5. Testing Too Many Things at Once

If you change the button color AND the headline AND the form fields AND the page layout, and variant B wins — what actually caused the improvement? You have no idea. AI will happily generate a "variant B" that changes fifteen things. Test one change at a time.

6. No Consideration for Logged-In vs. Anonymous Users

The cookie-based approach works for anonymous visitors. But what if someone visits on their phone (cookie A), then signs up on their laptop (new cookie, might get B)? They're now in both groups.

For logged-in users, you should assign variants based on the user ID, not a cookie. A common approach is hashing the user ID: variant = hash(userId + experimentName) % 2 === 0 ? 'A' : 'B'. This is deterministic — the same user always gets the same variant, across devices.

How to Debug A/B Tests with AI

A/B tests can fail silently. The code runs, numbers appear, but the data is garbage. Here's how to catch problems early.

Problem: Uneven Split

If variant A has 2,000 users and variant B has 500, something is broken in the assignment logic.

Debug Prompt

"My A/B test has 2,147 users in variant A but only 489 in variant B. The assignment uses Math.random() < 0.5. Here's my getVariant function: [paste code]. Why is the split uneven?"

Common cause: the existing-assignment check is failing, so users get re-assigned on every visit. If the re-assignment happens to be biased by timing (bots hitting variant A's page more), you get uneven splits.

Problem: Conversion Rate Is Zero for Both Variants

You see views being tracked but no conversions.

Debug Prompt

"My A/B test shows 1,200 views but 0 conversions for both variants. Here's my signup route: [paste code]. The trackEvent call is in the POST handler. Is the visitor ID being correctly passed from the form view to the form submission?"

Common cause: the cookie isn't being read on the POST request, so trackEvent uses a new visitor ID that has no matching assignment. Add cookieParser() middleware if it's not already included, and verify the cookie is being sent with the POST request (check the sameSite attribute).

Problem: Results Look Too Good to Be True

Variant B shows a 300% improvement. Before you celebrate:

Debug Prompt

"Variant B of my A/B test shows a 15% conversion rate vs variant A's 3%. That's a 5x improvement, which seems unrealistic. Here's the results query: [paste SQL]. Could there be a counting error? Are we double-counting conversions? Is one variant getting bot traffic?"

Common causes: duplicate event tracking (the conversion fires twice), bot traffic hitting one variant more than the other, or the view count is wrong so the denominator is off. Check your ab_events table for duplicate entries per user per event type.

Problem: Users Report Seeing Different Versions on Different Visits

This means assignments aren't persisting. Check three things:

Is the ab_visitor cookie being set with a long enough maxAge?
Is your cookie being blocked by browser privacy settings? (Safari's ITP limits third-party cookies to 7 days)
Is the existing-assignment database query running before the random assignment?

If you're using error monitoring, add a log entry whenever a visitor gets re-assigned. That'll tell you exactly when and how often assignments are being lost.

Using AI to Analyze Results

Once your test has enough data, you can ask AI to analyze the results instead of building statistical analysis into the app:

Analysis Prompt

"Here are my A/B test results after 3 weeks: Variant A: 4,231 views, 127 signups (3.0%). Variant B: 4,198 views, 176 signups (4.19%). Is this statistically significant? Calculate the p-value using a two-proportion z-test. Should I declare a winner or keep running the test?"

This is a great use of AI — statistical math is something it does reliably. You get the answer in seconds instead of building a statistics library into your app.

Beyond Basic A/B: What Comes Next

Once you've run your first A/B test, you'll naturally want to do more sophisticated experiments. Here's the vocabulary so you know what to ask AI for:

Multivariate testing (MVT) — Testing more than two variants. A/B/C/D testing. Requires proportionally more traffic.
Feature flags — A broader system where A/B testing is one use case. Tools like LaunchDarkly, PostHog, and Unleash provide this. Ask AI: "Add PostHog feature flags to my Express app."
Segmented testing — Running different tests for different user groups (mobile vs desktop, new vs returning). Uses the same infrastructure but adds targeting rules.
Bandit algorithms — Instead of a fixed 50/50 split, automatically send more traffic to the winning variant over time. This is more complex but wastes less traffic on the losing variant.

For most AI-built apps, the basic setup shown above is more than enough. You can always upgrade to a third-party service when your traffic justifies it.

Rate Limiting Your Tracking Events

One thing to watch out for: if your app gets significant traffic, writing to the database on every single page view can strain your system. Consider batching events or using rate limiting on your tracking endpoints. A simple approach is to track events in memory and flush to the database every 10 seconds, or use a message queue for high-traffic apps.

Using RAG for Smarter Analysis

If you're building AI-powered features into your app, you can use RAG (Retrieval-Augmented Generation) to build an internal analytics assistant. Feed your A/B test results into a vector database, and you can ask natural-language questions like "Which experiments improved signup conversion the most in Q1?" This is advanced, but it's a powerful pattern for data-driven teams.

What to Learn Next

Backend What Is Session Management?

A/B testing relies on identifying users across requests. Session management is the underlying pattern — understand it and your A/B assignments become more reliable.

Databases What Is PostgreSQL?

The A/B test code stores everything in PostgreSQL. Learn how the database works, how to query it, and how to avoid common performance pitfalls.

Backend What Is API Rate Limiting?

High-traffic A/B tests can flood your database with tracking events. Rate limiting protects your infrastructure while keeping your experiment data intact.

Backend What Is Error Monitoring?

A/B tests can break silently — wrong assignments, failed tracking, database errors. Error monitoring catches these before your experiment data is ruined.

AI Tools What Is RAG?

Once you're running experiments at scale, RAG lets you build an AI assistant that can analyze your A/B test history using natural language.

Backend What Are Feature Flags?

Feature flags and A/B tests are natural partners. Use flags to control who sees each variant, and A/B metrics to decide which variant wins.

Frequently Asked Questions

What is A/B testing in simple terms?

A/B testing means showing two different versions of something (a button color, a headline, a pricing page) to different users at the same time, then measuring which version gets better results. Version A is the control (what you have now), and Version B is the variant (what you're testing). The version that performs better wins. It's like having two doors into your store with different signs — you count which sign brings more people in.

How many users do I need for a valid A/B test?

For statistically meaningful results, you generally need at least 1,000 visitors per variant, though the exact number depends on your baseline conversion rate and the size of the improvement you're trying to detect. A common rule of thumb: if your conversion rate is around 3% and you want to detect a 1-percentage-point improvement, you'll need roughly 4,800 visitors per variant. Small apps with under 100 daily users may need to run tests for weeks or even months. AI almost never mentions sample size — always ask it to calculate before launching.

Can I A/B test without a third-party service?

Yes. A basic A/B test only needs three things: a way to assign users to groups (a cookie or user ID hash), logic to show different content based on the group, and a way to track which group converts. You can build this with a simple backend endpoint and a database table — exactly what the code in this article shows. Third-party services like LaunchDarkly, PostHog, or Optimizely add advanced features like statistical analysis, visual editors, and targeting rules, but they're not required to get started.

What is the difference between A/B testing and feature flags?

Feature flags control whether a feature is turned on or off for certain users. A/B testing specifically measures the impact of a change by comparing two groups and tracking conversion metrics. Feature flags are the mechanism; A/B testing is the experiment. In practice, A/B tests are often implemented using feature flags — the flag controls which variant a user sees, and analytics track the outcome. You can have feature flags without A/B testing (e.g., gradual rollouts), but A/B testing almost always uses some form of feature flagging.

Why does my A/B test show different results every time I refresh?

If you see different variants on each page load, your assignment isn't being persisted. The user needs to stay in the same group for the entire experiment. Check three things: (1) you're setting a cookie or storing the assignment in your database, (2) the cookie isn't expiring between visits or being blocked by browser privacy features, and (3) your code reads the stored assignment before generating a new random one. This is the most common bug in AI-generated A/B test code — the "check existing assignment" step either doesn't work or isn't there at all.

Last updated: March 26, 2026. Code examples tested with Node.js 22, Express 5, and PostgreSQL 16.