|
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction |
Soundness: 3: good
Presentation: 3: good
Contribution: 2: fair
Rating: 4: marginally below the acceptance threshold
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
The paper targets speech‑to‑speech (S2S) dialogue and proposes a Turing‑style evaluation. It builds a dataset that includes human–human, human–machine, and pseudo‑human (TTS‑synthesized) conversations, and uses a game‑based human study to judge whether current systems “sound human.” The headline finding is that none of the evaluated systems pass. To explain why, the authors introduce a fine‑grained human‑likeness taxonomy and crowd annotations, showing that shortcomings lie in paralinguistics (rhythm, intonation, stress, fillers, breath), emotional expressivity, and a “mechanical persona,” rather than semantics. Off‑the‑shelf AI judges are unreliable; authors therefore propose an interpretable evaluator that first produces ordinal scores on the taxonomy dimensions and then makes a transparent linear human‑vs‑machine decision.
1. Focused problem and clear protocol. The work targets the central question for speech‑to‑speech (S2S): Do these systems actually sound human in multi‑turn dialogue? Instead of testing isolated sub‑skills, the study frames evaluation as a Turing‑style decision under realistic interaction. The tri‑part setup, human–human, human–machine, and a TTS‑based pseudo‑human control, gives a clean yardstick for what “human‑like” means. Bilingual coverage and multiple everyday topics reduce overfitting to any single style and make results more comparable across systems. The recording and interaction procedures are standardized, improving internal validity and making cross‑system contrasts meaningful.
2. Interpretable automatic judge. The proposed evaluator is intentionally two‑stage: first map dialogs to ordinal scores on the human‑likeness dimensions, then apply a linear decision with symmetry regularization. This keeps the prediction space aligned with how humans actually rate speech (ordered categories), while the linear head provides transparent attribution: which dimensions pushed a sample toward “human” vs. “machine.” The design is modular, portable across collections, and produces diagnostics that engineers can act on (e.g., prosody shaping, disfluency modeling, persona calibration).
3.Memorable headline result. The paper lands a crisp, communicable takeaway: contemporary S2S systems still fail a Turing‑style test. That single sentence is easy for the community to remember and cite, and it reframes progress: sounding human is not simply a by‑product of better recognition or text generation. Because the result was obtained under a matched protocol with both human–human and synthesized controls, it carries weight beyond a one‑off demo and can serve as a reference point for future work.
1. Application‑heavy, limited theoretical novelty. The main novelty lies in system integration rather than theory. The core claim, that semantics alone cannot sustain effective speech interaction, is treated as an empirical observation, not a theoretical insight. Adding paralinguistic cues (prosody, affect, persona) targets known gaps, long discussed in TTS and affective computing. The work validates their importance but does not explain underlying mechanisms or interactions, nor does it offer a general theoretical framework.
2. HCI‑leaning narrative.The main text emphasizes human‑study design, demographics, and logistics more than theory, ablations, and generalization, which may misalign with ICLR.
The manuscript emphasizes HCI logistics (recruitment, demographics, task design) while skimming key ML details (architecture ablations, training dynamics, hyperparameter sensitivity). This balance misaligns with ICLR.
3. Limited external generalization and statistics. Training and evaluation are tightly coupled to the same data collection protocol, raising distribution-shift concerns. Evaluation relies on a single-threshold binary decision without calibration, uncertainty quantification, or decision-boundary analysis. Missing ablations across acoustic conditions, speaker populations, and interaction settings leave generalization in doubt. Statistical reporting should add uncertainty intervals, significance tests, calibration curves, and failure-mode taxonomies to substantiate reliability and scope. Also need more systematic ablations to clarify the theoretical footing of ODL, justify the 18‑dimension design and strengthen label reliability.
Q1. Theoretical Positioning of ODL
What is the precise formal correspondence between ODL and classical ordinal regression models (cumulative-link, threshold models)? What does ODL add beyond these baselines in parameterization or inductive bias? Which properties are guaranteed by construction versus empirically observed?
Q2. Necessity of 18 Dimensions
Why exactly 18 dimensions rather than a single score or reduced factorized set? Provide dimension-wise correlation analysis, clustering structure, and direct comparisons with (i) single-score model and (ii) low-factor model to demonstrate irreducibility.
Q3. Annotation Reliability and Expert Impact
Need to report inter-rater reliability for dimensions. Quantify how expert edits change label distributions and downstream judge performance. Test measurement invariance across languages and subgroups. |
Fully AI-generated |
|
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction |
Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Rating: 2: reject
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. |
This paper presents a study of testing if current speech-to-speech (S2S) models, e.g., GPT-4o's advanced voice mode, can pass a Turning
test of conducting human-like conversations. Authors firstly constructed S2S dialog datasets between human-model conversations, as
well as synthesizing speeches with TTS models. After that, approximately 3K human judgements were carried out, measuring the human-likeness of current S2S models. To measure the human-likeness, authors defined 18 dimensions such as memory consistency, use of fillers etc.
Authors show that humans can easily identify human-model dialogs, such that current S2S models cannot pass the Turning test. Authors also show that off-the-shelf AI models cannot serve as a judge for the test; finetuning the off-the-shelf AI on the authors' created dialog datasets enable AI to better judge human-model dialogs.
- This is the first evaluation of S2S models from the Turing test prospective, which is new and interesting.
- Evaluation details are well presented with clear takeaway messages, and the paper is easy to follow.
- It seems the conclusion, i.e., current S2S models cannot pass Turning test, is very much expected. E.g., VoiceBench[1] has shown that current S2S models still largely lag behind their text counterparts. Then the main contribution seems to be the "Turing Test" framing. I think authors should more clearly demonstrate how this paradigm can provide more insights than existing S2S benchmarks like VoiceBench.
- The generalization ability of the judge model finetuned on authors' created datasets is unknown. It is very much expected that the in-domain (e.g, based on the 18 evaluation dims) finetuned judge model performs the best on judging the dialogs. Authors should test the correlation of the judge model with humans, when being applied to some other real-world OOD dialogs to ensure the generalization ability, beyond only testing on the pseudo-human datasets.
- Some claims seem contradicting (cf Q2).
[1] VoiceBench: Benchmarking LLM-Based Voice Assistants
- Line140: The datasets seem contain both English and Chinese. Are the 28 participants from 10 countries all speak both two languages?
- Line278-279 describe that current S2S are limited by aspects like topic understanding etc. While Line332 claims that S2S models have largely solved the foundational challenges of understanding and generating clear and coherent dialogue turns. Wouldn't these claims contradicting? How do you define semantic and paralinguistic tasks? |
Fully human-written |
|
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction |
Soundness: 4: excellent
Presentation: 4: excellent
Contribution: 4: excellent
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully. |
This paper performs a Turing test experiment with multiple speech to speech models. Multiple versions of the experiment are tested using controlled conditions that attempt to minimize potential confounds. To collect more data, a mobile app is released as well. Conversations were also annotated with 18 dimensions. The results show key gaps in all S2S models. Using these insights, a smaller speech model is fine-tuned to predict these ratings and then classify whether a given dialog is human-to-human or human-to-s2s, attaining very high accuracy while still being interpretable.
- Clean experimental setup that controls for multiple confounds and tests a diverse suit of S2S models, including TTS models using LLM-generated text. Experiments demonstrate key gaps.
- Rich analysis of why models fail the test through a multifaceted analysis of the conversations qualities. This analysis involved collecting human perceptions using crowdsourcing
- Experiments to see if other audio models could pass the test, with key gaps in existing models. Proposes new model and design to get an interpretable explanation, showing that these features (when correctly identified) are reliable indicators of gaps in current system
- Released platform and game to collect new annotations, helping grow data and potentially incorporate more models in the future
- Very well written paper with clear motivation, analysis, and visuals.
- The biggest gap to me was in the lack of details around the annotation for conversation qualities. These are barely mentioned in text, so I was expecting to see a much more detailed report in B.5. However, important questions are hard to answer, such as who annotated (which platform?), how many annotators were there, did annotators agree on these qualities, how much were annotators paid, or what quality controls were present, if any. Given the importance of this data for your results and later test-taking model, more details are needed to assess the quality of the data and for future replicability.
- The paper itself is very dense (though well written). However, the space constraint has pushed many details to the appendix which hinders readability at times.
- How was the annotation performed (see questions above) |
Fully human-written |
|
Human or Machine? A Preliminary Turing Test for Speech-to-Speech Interaction |
Soundness: 3: good
Presentation: 4: excellent
Contribution: 3: good
Rating: 8: accept, good paper
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked. |
This paper introduces a Turing test study for speech-to-speech (S2S) dialogue systems, evaluating 9 LLMs against human speakers. Using a gamified online platform, the authors collect 2,968 human judgments across 1,486 dialogues in English and Chinese. None of the tested systems passed the Turing test, revealing a persistent gap in human-likeness. To understand the causes, the paper develops a taxonomy of 18 human-likeness dimensions, spanning semantic, paralinguistic, emotional, and persona-related traits. Crowd annotations show that while S2S systems perform near human levels in semantic coherence, they fall short in prosody, emotional expression, and conversational naturalness. Finally, the paper proposes an interpretable AI judge, which is a finetuned LLM that predicts human-likeness with strong transparency and accuracy, outperforming both humans and baseline AI judges, either prompted or LoRA-finetuned.
- This paper presents the first formal Turing test for S2S dialogue systems, extending evaluation beyond text to spoken interaction, which is an impactful direction given recent advances in conversational AI.
- The paper convincingly shows that the bottleneck of S2S dialogue systems is no longer semantic understanding but rather paralinguistic and emotional expressivity, which is an under-explored dimension in S2S research, offering valuable insights for improving S2S design.
- The interpretable AI judge is a standout contribution, which provides a reproducible and scalable framework for automatic S2S evaluation, with decent human-machine discrimination accuracy.
- The gamified Turing test platform may attract casual participants who do not conduct the human-machine discrimination carefully. This paper does not clearly describe participants’ quality-control mechanisms such as attention checks, response-time filtering, etc. This could bias the Turing test results.
- It is unclear how many unpassed Turing test cases are because LLMs avoid human disfluency cues (or other fixes of human speech deficiencies). This is an easy-to-detect feature but a minor issue, since superior speech fluency is rather preferred by users in real-time applications. It would be better to track the cause of each Turing test failure and analyze the rate of such minor causes versus more severe causes.
- Any additional discussions or experimental results to resolve the above weaknesses?
- Would shorter dialogues (e.g., 20-second versus 60-second dialogues) have more chance to pass the Turing test? |
Fully human-written |