Every major design decision in melune traces back to applied linguistics or language assessment research. This page explains the most important findings in plain language — what they mean, where the evidence is strong, and where it is still developing. The goal is not to dress up product choices in academic language. It is to explain the actual tradeoffs.
Second-language acquisition research is consistent on one thing: interaction is necessary but not sufficient. Two major meta-analyses from 2006 and 2007 find large effect sizes for interaction-based L2 development — the question is no longer whether conversation helps, but which parts of it work for which learners on which targets. Not all conversation is equal.
The interactionist framework describes a chain: input → noticing → intake → integration → output. A learner has to notice a form before it can be acquired (Schmidt, 1990; Gass, 1997). Morphosyntactic forms — particles, case marking, verb position, agreement — are especially likely to slip by unnoticed during real-time speech because they carry low communicative weight. The conversation makes sense without them. This is why spontaneous conversation alone produces less grammar acquisition than structured interaction with a deliberate noticing mechanism built in.
The practical question this raises is: what should the tutor do inside the conversation? How it pushes, corrects, sequences tasks, and scaffolds determines whether a session produces learning or just elapsed time. melune is a system that can make those decisions consistently — which is precisely why the research on what those decisions should be matters so much.
melune treats the live session as an evidence-collection pass, not the main learning event. The conversation produces performance data; the replay, diagnosis, and pattern model are where that data becomes actionable. The tutor behavior in curriculum sessions is designed to push output on focus categories while preserving conversational flow — not to interrupt constantly, but not to let errors pass unaddressed either.
Schmidt's noticing hypothesis argues that learners cannot acquire a form they haven't consciously noticed. Gass's (1997) model adds the mechanism: input is apperceived when a learner recognizes a gap between what they know and what they're hearing; apperceived input becomes comprehended when they parse it for both meaning and form; comprehended input becomes intake through further psycholinguistic processing. The problem is that noticing is hard to achieve in real time during a fast conversation.
Research on what learners actually notice during interaction shows a consistent asymmetry: lexical and phonological features are noticed accurately; morphosyntactic features are often missed (Mackey, Gass & McDonough, 2000). A learner can communicate the past tense through context words without ever attending to the past-tense morpheme. This means that the correction most worth making is often the one least likely to be noticed if it arrives during live speech — and most likely to be noticed when it appears in a salient, unhurried post-session review.
There is also an individual-differences layer. Working memory and attention control moderate what gets noticed (Loewen & Sato, 2018). The same input can become intake for one learner and exit the system entirely for another. Post-session replay does not solve this, but it does give the learner a second, third, and fourth pass at forms that the real-time session couldn't reliably surface.
Replay is melune's principal noticing instrument. It exists to bring attention to exactly the forms that tend to slip by during real-time speech: particles, case endings, verb position, agreement morphology. Ranked corrections, replayable audio of the original moment, IPA comparisons, and metalinguistic explanations are all noticing aids — not polish. The replay section "Moments that mattered" surfaces the highest-diagnostic observations above the full transcript, so a learner can identify what to work on without reading every line.
Oral corrective feedback has solid meta-analytic support. Russell & Spada (2006) found a very large effect size (d ≈ 1.16) for CF on grammatical development. Lyster & Saito (2010) found that prompts — output-eliciting moves like clarification requests and metalinguistic clues — produced larger effects than recasts (reformulations the tutor provides) in classroom studies. Effects are also more pronounced on delayed posttests than immediate ones, suggesting the benefits accumulate over time rather than appearing instantly (Mackey & Goo, 2007).
There is an important caveat from Nobuyoshi & Ellis (1993). In their study, learners whose clarification requests were contingent on specific past-tense errors showed large, durable gains. But one of the three pushed learners — described as functionally oriented — responded to clarification requests by reformulating the meaning, not the form. Structurally oriented learners take a nudge as a cue to repair the grammar; functionally oriented learners take it as a cue to paraphrase. A system that only tracks whether an error was followed by a different surface form will conflate these two very different outcomes.
The post-session feedback that arrives after a learner has left real-time production pressure — metalinguistic, explicit, with explanations — aligns with the more salient end of the feedback spectrum. This is a deliberate choice, not a default. In-session correction is available but not the default, because preserving conversational flow matters for producing authentic performance evidence.
melune tracks repair attempts after tutor nudges, distinguishing "repaired after a small hint" from "repaired after explicit explanation" from "paraphrased around the problem." These outcomes carry different diagnostic weights. A learner who repairs form after a clarification request is in a different instructional position than one who consistently reroutes meaning to avoid the target structure. Both are useful signals; only one suggests the pattern is stabilizing.
Task-based language teaching has a positive meta-analytic record across learning outcomes (Bryfonski & McKay, 2019). More specifically, information-gap tasks — where each participant holds information the other needs — produce more pushed output than opinion or topic chat, because genuine meaning negotiation is required. You cannot paraphrase your way to success if the other person needs a specific piece of information.
Task complexity research adds nuance. Varying resource-directing complexity — reasoning demand, displaced reference, number of elements — and resource-dispersing complexity — planning time, topic familiarity — independently affect language production. Abdi Tabari et al.'s (2024) meta-analysis of task sequencing found that simple-to-complex progressions improve syntactic accuracy and complexity in L2 production. The implication is that the same communicative goal, approached with gradually increasing cognitive demand, produces more durable grammatical improvement than topic chat at a fixed difficulty level.
Communicative success matters as a distinct outcome. A learner can accomplish a task — book a table, ask for directions, negotiate a price — with imperfect grammar. That success is meaningful evidence. An app that only counts correction events misses it entirely, and an app that blocks milestones for grammar errors when the communication worked is optimizing for the wrong thing.
Curriculum and scenario sessions are built around information-gap tasks with defined communicative outcomes, not generic conversation topics. Task sequencing adjusts both linguistic readiness and cognitive complexity. Sessions are classified as challenge, confidence, or review, with scaffold fading as patterns stabilize. A learner can earn a milestone after repeatedly succeeding at a goal scenario — minor grammar errors don't block it if the communication worked.
The spacing effect — distributed practice produces more durable retention than massed practice — is one of the most replicated findings in cognitive psychology, and it holds for second-language learning specifically. Kim & Webb's (2022) meta-analysis of 48 studies (N = 3,411) finds spaced practice beats massed at a medium-to-large effect size, with longer spacing especially effective on delayed posttests. Suzuki (2017) extends this to L2 morphology; Saito & Chen (2025) find a roughly two-to-one effect advantage for spaced phonetic training over massed equivalents.
The mechanism, described by Cepeda et al. (2006) and elaborated in cognitive load research (Chen, Paas & Sweller, 2021), is straightforward: working memory depletes within a session; rest and sleep restore it and support consolidation. Massed practice can produce short-term fluency gains that don't survive the week. Daily-to-weekly intervals are the range where L2 spacing effects are clearest. Ren et al.'s (2023) pragmatics meta-analysis adds a lower bound: treatments under three hours produce the smallest pragmatic gains; treatments over eighteen hours produce the largest. Four to six sessions per week at fifteen minutes each reaches roughly six hours per month — above the smallest-effects floor, well into the meaningful range.
The implication for subscription design is uncomfortable but honest: unlimited conversation minutes do not align with what the literature supports. The evidence favors capped, distributed practice with ample time for the between-session modes — drills, shadowing, vocabulary review — that serve as spaced rehearsal between conversation sessions.
Sessions cap at fifteen minutes by design, not as a platform limitation. Drills, shadowing, and vocabulary review are uncapped in all tiers because they are the between-session spaced-rehearsal mechanism. Managed subscription tiers will cap conversation minutes but not practice modes. When the app says "see you tomorrow," it is a learning design choice. The retest loop — scheduling follow-up tasks for important patterns at two-to-four days and then one-to-two weeks — is built around the same intervals the spacing literature identifies as most effective for L2 retention.
Pronunciation instruction has a complicated meta-analytic profile. Saito & Plonsky's (2019) analysis of 77 studies finds reliable gains in controlled, feature-specific production — the learner improves on the target feature when asked to produce it deliberately. But transfer to spontaneous speech is, in their words, "relatively unclear." Saito & Akiyama's (2017) semester-long video-interaction study — the closest analog in the literature to melune's free-conversation loop — found gains in comprehensibility, fluency, and lexicogrammar, but not in accentedness or fine-grained segmental features in spontaneous production.
There is a technical constraint that reinforces this. Williams (2024) documents that forced-alignment tools — the standard acoustic measurement pipeline for phoneme-level scoring — degrade significantly on spontaneous speech: hesitations, voice qualities, and ASR transcription errors all introduce noise that makes reliable phoneme-level scoring impractical. The same pipeline that works well on scripted sentences performs much worse on natural, unplanned speech. Running the analyzer on free conversation would produce unreliable results even if the pedagogical rationale supported it — and the Saito & Plonsky evidence suggests the pedagogical return would be small anyway.
Pronunciation instruction does work in controlled, scripted contexts. Modes where the target utterance is known in advance — shadowing, read-aloud, drills, and the placement test's sentence-repetition phase — produce the kinds of gains the meta-analysis finds. The scope is narrow by design, not by limitation.
Pronunciation analysis in melune runs only in scripted modes: shadowing, read-aloud, drills, and the placement test's spoken phase. Free conversation, scenarios, and curriculum sessions do not run a pronunciation pass. This is explained in the app as a trust signal, not apologized for: "Pronunciation feedback comes from shadowing and drills, where it can be measured reliably. Free conversation focuses on grammar, vocabulary, pragmatics, and fluency." This also reduces per-session cost significantly and keeps the analysis pipeline honest about what it can and cannot claim.
AI-enabled language assessment shows a positive effect in Chen et al.'s (2024) systematic review and meta-analysis — but effectiveness depends on implementation design, specifically on whether the system communicates uncertainty, grounds claims in evidence, and avoids overconfident labels. Suzuki et al.'s (2025) diagnostic speaking-assessment study found that AI-driven weakness identification plus contextualized feedback improved learning more than task repetition alone and increased learners' awareness of their weaknesses. The literature supports AI as a useful assessment signal. It does not support AI as an oracle.
Several challenges complicate proficiency measurement. Self-assessment correlates moderately with measured performance at the population level (Li & Zhang, 2019), but individual self-assessments are noisy and should not drive a public level label. One good session should not jump the public label — level changes should be based on converging evidence from multiple sources over time (Park et al., 2021). The AI scoring pipeline also needs internal validity checks: do session-level diagnoses agree with drill performance, vocabulary retention, and placement retake results? Until systematic agreement evidence exists, strong learner-facing claims require appropriate hedging.
Elicited imitation — sentence repetition as a proficiency proxy — has reasonable construct validity evidence for distinguishing proficiency levels across multiple studies (Yan et al., 2021). But published, validated EIT instruments were developed under specific psychometric conditions. A locally generated short form using the same design principles should be labeled "EIT-style" or "construct-inspired" until there is local validation evidence. melune's placement test uses this framing.
Dynamic assessment adds one more distinction worth preserving: the difference between what a learner can do independently and what they can do with a nudge (Anton, 2009; Kamrood et al., 2019). These are not the same proficiency. A learner who repairs a case error after a gentle clarification request is in a different instructional position than a learner who needs explicit metalinguistic explanation. Both states are useful; only the public level label should wait for independent performance.
melune separates the internal adaptive estimate from the public proficiency label. The internal estimate updates every session; the public label moves slowly and only when the moving average clears a confidence threshold. Labels start hedged — "B1 area, calibrating" — and carry a visible confidence band. The evidence behind every claim is accessible to the learner: what evidence it is based on, how many observations support it, how fresh they are, and whether they came from spontaneous conversation, controlled practice, a drill result, or a self-check. Self-assessment is visible as evidence but never the primary source.
Key papers cited on this page. The full research synthesis — including pedagogy, assessment, task design, and pronunciation — is maintained in the project's internal documentation. Citations here are limited to sources directly referenced in the sections above.