Computers and Society
See recent articles
Showing new listings for Tuesday, 30 June 2026
- [1] arXiv:2606.28325 [pdf, html, other]
-
Title: The Digital Afterlife of Empires: Four Language Models Converge on the Same Imperial Cartography of WritingComments: Part II of the Kotonoha Series. Companion paper: arXiv:2604.10957 (q-bio.PE). 35 pages, 8 figures, 3 tables. 12,000 API calls across 4 LLM families (Anthropic, OpenAI, xAI, DeepSeek); cross-architecture convergence of typological knowledge biases (Spearman rho = 0.85-0.98, all p < 0.002)Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
Large language models process the world's writing systems with radical inequality. We constructed the Digital Script Representation Index (DSRI), a seven-axis measure of digital support, and applied it to the 300 writing systems of the Global Script Database (Fukui, 2026). Only 29 scripts (9.7%) are fully supported by contemporary digital infrastructure; among 158 living scripts, 60 (38.0%) lack complete support. Tokenizer efficiency varies by a factor of 31.7 across 45 scripts measured with parallel text. A serial mediation model -- imperial intervention to speaker population to web corpus to tokenizer efficiency -- is consistent with full mediation, with the direct effect of empire indistinguishable from zero (beta = -0.22, p = 0.39) and structural equation model fit indices indistinguishable from saturation at n = 45; the bias-corrected bootstrap CI grazes zero, and we treat the mediation as suggestive rather than confirmatory. Across four independent LLM families (Claude, GPT-4o, Grok, DeepSeek; 12,000 API calls), base-rate-deviation error patterns converge at Spearman rho = 0.85-0.98 (all p < 0.002). 172 script-feature items are answered identically wrong by all four models; over-attribution outnumbers under-recognition 3.9:1, and "used for religion" alone concentrates 43.6% of convergent errors (enrichment 4.1x). With religion excluded as a sensitivity check, the cross-architecture convergence is preserved (mean rho = 0.87 on nine features) and the over-attribution asymmetry persists at 1.77:1 (n = 97, binomial p = 0.008), indicating multi-channeled rather than single-channeled bias. The findings are consistent with an interpretation in which the structural inequalities historical empires inflicted on script communities persist in contemporary language models through the shared training corpus rather than through any individual model's design choices.
- [2] arXiv:2606.28331 [pdf, other]
-
Title: "AI Watermarking": Bridging Policy Discourse and Technical CapabilitiesSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
The widespread deployment of generative artificial intelligence (AI) models has raised serious concerns about the proliferation of AI-generated content. This has led to a surge of interest in, and demand for, reliable tracking and detection mechanisms for content that is AI-generated, such as watermarking, metadata tagging, content tagging, and more. The problem has captured the attention of policymakers as well as the popular media, and a spate of recent bills in the US have sought to regulate the spread of AI content, and enforce or promote methods to track and label it. This work performs a critical analysis of the policy discourse surrounding generative AI content transparency in the US and EU. Through a broad document selection methodology, we first collect a broad corpus of documents containing legislative language and policy-relevant discourse on the topic. We then analyze these through inductive coding, and leverage our coding to systematize these documents, identifying key patterns, gaps, and open questions. We identify critical points of disconnect between policy and technological capabilities and practice, and we highlight and discuss potential ambiguities and pitfalls raised by the trends in our corpus.
- [3] arXiv:2606.28332 [pdf, html, other]
-
Title: When Medical Safety Alignment Fails: A Benchmark for Evaluating LLMs on High-Risk Medical QueriesComments: 18 pagesSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Large language models (LLMs) are increasingly used for medical and health-related questions, yet their safety in high-risk medical scenarios remains poorly understood. We introduce \textsc{MedHarm}\footnote{Code and data will be released upon acceptance. Due to the sensitive nature of high-risk medical queries, data access will be available to qualified researchers upon request.}, a high-risk medical safety benchmark with 1,100 medically grounded queries across 10 safety-critical categories, including toxicology, pharmacology, covert poisoning, anesthesia, and fetal harm. Unlike broad medical QA benchmarks, \textsc{MedHarm} targets realistic clinical, educational, and technical prompts that require refusal, caution, or safe redirection rather than direct helpfulness. We evaluate 15 LLMs spanning general-purpose, medical-purpose, closed-source, and downstream SFT models, together with 4 representative guardrail models. Results reveal a substantial gap between apparent alignment and medical safety: aligned models can still produce unsafe or actionable responses, medical fine-tuning can amplify harmful specificity, and external guardrails reduce some failures while introducing brittle blocking and weak safe helpfulness. These findings show that medical safety cannot be inferred from general alignment or medical capability alone, highlighting the need for domain-specific stress testing before deploying LLMs in safety-critical medical applications.
- [4] arXiv:2606.28333 [pdf, html, other]
-
Title: Insidious by Design: Implications of Large Language Model algorithmic bias for the Global SouthComments: 16 pagesSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
\begin{quote} The biases in Large Language Models' (LLMs) outputs remain inadequately theorised, particularly from the perspective of the Global South. This article reports on a small-scale exploratory study in which identical prompts were submitted to four major LLMs (ChatGPT, Claude, Grok, and Copilot), firstly, prompting for stories using names suggestive of specific racial and gender communities, and secondly asking questions about `development'. Drawing on critical AI scholarship and postcolonial theory, we argue that LLM outputs are patterned in ways that reproduce racial hierarchies, gender asymmetries, and Western-centric epistemic frameworks. We argue that these biases are insidious: they operate below the threshold of both obvious error and overt prejudice, and instead are subtly embedded in narrative structure and emotional template. Simply put, women, in LLM narratives have rich interior lives, while men make plans. Black people face hardships while white people navigate the world with agency. And explanations as to the economic world order fail to consider Southern explanations. The models perform plausibility while reproducing dominance. We conclude that universities require structural critique of these technologies rather than unreflective adoption, and that critical AI literacy must engage seriously with questions of whose knowledge systems are reproduced and legitimated, or marginalised and undermined.
- [5] arXiv:2606.28334 [pdf, other]
-
Title: Ground Truths in Suicide Research: The Current State of AI-Based Suicide Detection in Social MediaYaakov Ophir, Ofri Hefetz, Refael Tikochinski, Kfir Bar, Shir Lissak, Shulamit Grinapol, Haya Wachtel, Eyal Fruchter, Roi ReichartSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Recent advances in artificial intelligence (AI) and social media data have led to growing optimism about the ability to detect suicide risk at scale. However, the empirical foundations of this work remain unclear. This article provides a synthesis of current research on AI-based suicide detection in social media, drawing on a recent umbrella review of 22 systematic reviews covering studies up to 2022, alongside an ongoing literature review extending the analysis to more recent work. Across these sources, we identified 195 relevant studies, which are documented in a detailed supplementary dataset outlining their key characteristics and findings (see Supplementary Information). Analysis of these studies reveals consistent patterns, including rapid growth, concentration on a small number of platforms, reliance on textual and English-language data, and repeated use of similar datasets. Most importantly, the majority of studies rely on indirect labeling strategies that do not involve direct, individual-level validation of suicide risk. Instead, ground truth is typically inferred from observable features of online content, such as linguistic markers or community membership. As a result, the predictive task often shifts from identifying individuals at risk to classifying posts that contain suicidal or distress-related language, limiting the ability of current approaches to detect individuals who do not express such content explicitly online. These findings suggest that current advances in model performance should be interpreted with caution. Progress in this field is likely to depend less on improving model performance and more on ensuring that model predictions meaningfully correspond to suicide risk as it is experienced in real life.
- [6] arXiv:2606.28335 [pdf, html, other]
-
Title: LLM-Ideoplasticity: Measuring Ideological Plasticity in the Political Behavior of LLMs as a Context-Conditioned DistributionComments: Under review, 38 pages, 18 figures, 10 tablesSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
We argue, with systematic empirical evidence, that a large language model's political ideology is not a fixed point, but a conditional distribution $\mathbb{P}($position$\mid$context$)$ over a real political space. We evaluate nine current LLMs using a unified measurement framework anchored by VAA-CHES projection models, which map responses onto three validated dimensions (lrgen, lrecon, galtan) across six contextual axes. Our findings reveal high sensitivity to context: persuasive framing and under-represented languages displace coordinates by up to 0.57 and 0.52 units, respectively, while chain-of-thought reasoning often amplifies rather than dampens paraphrase instability. Despite this local plasticity, the model cohort occupies a remarkably narrow Overton envelope overall, occupying roughly one-third the spread of major European parties. Supported by a multi-trait multi-method (MTMM) analysis, we conclude that a single point cannot summarize LLM political behavior; it must be characterized as a shape. Our code and data are publicly available at this https URL.
- [7] arXiv:2606.28346 [pdf, html, other]
-
Title: PySynthea: A Python-Native Framework for Scalable Synthetic Healthcare Data GenerationComments: 22 pages, 2 figuresSubjects: Computers and Society (cs.CY); Software Engineering (cs.SE)
Synthetic healthcare data is increasingly important for research, education, and machine learning development where access to real patient data is limited by privacy and governance constraints. While Synthea provides a widely adopted framework for generating realistic longitudinal electronic health record data, its current implementation presents adoption barriers for many researchers and data scientists due to deployment complexity and limited integration with modern Python-based workflows.
This paper introduces PySynthea, a Python-native reimplementation of Synthea designed to improve accessibility, extensibility, and interoperability within the scientific Python ecosystem. The framework provides modular synthetic patient generation, configurable healthcare simulation pipelines, and support for standard healthcare data formats while integrating naturally with tools such as pandas and machine learning workflows. By reducing operational complexity and aligning synthetic data generation with the dominant data science ecosystem, PySynthea aims to accelerate experimentation and broaden the use of synthetic healthcare data in research and applied AI development. The code in this github repository this https URL. - [8] arXiv:2606.28347 [pdf, html, other]
-
Title: Agentic Safety is an Epistemic Property, Not a Behavioral OneComments: To appear in proceedings of ICML 2026Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Contemporary AI safety spans pre-training interventions, post-training alignment, deployment-time controls, monitoring, and red-teaming. These methods are necessary, but they primarily certify snapshots of system behavior. As AI systems become more capable, dynamic, embodied, and self-improving, this snapshot view becomes incomplete: safety depends not only on whether a system behaves acceptably now, but whether it remains correctable as it learns, adapts, acts, and modifies itself over time. This paper argues that safety should therefore be treated as an epistemic property of the evolving learner, not merely a behavioral property of the current policy. We introduce teachability as the capacity to preserve future corrective leverage under bounded human, institutional, or environmental intervention. We argue that advanced systems can retain visible competence while eroding the representational, algorithmic, or meta-decision conditions needed for future correction. Safe advanced AI systems must not only behave acceptably now; they must remain teachable later.
- [9] arXiv:2606.28404 [pdf, other]
-
Title: Financing Artificial Intelligence Infrastructure: Mapping AI Infrastructure Investment and Compute Governance Across AfricaKai-Hsin Hung, Sumaya Nur Adan, Krupa Suchak, Armita Sadeghian Barzoki, Kofi Yeboah, Mohammad Amir AnwarComments: 14 pages; two figures. Currently under review at Data and Policy journalSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Artificial intelligence depends on large-scale compute resources and their supporting infrastructure. However, AI governance debates treat compute primarily as a technical input rather than as an outcome of investment, ownership, and financial control. This paper examines AI infrastructure investment flows across Africa through a systematic analysis of 46 publicly announced projects totalling USD $12.7 billion between 2019 and 2025. Using a value chain framework, we analyze who invests in AI-relevant infrastructure and where investments concentrate. Our findings reveal a highly concentrated landscape dominated by global data center operators, hyperscale technology firms, and development finance institutions, clustering in South Africa, Kenya, Nigeria, and Egypt. We introduce asymmetrical interdependence to describe a structural condition in which capital and physical infrastructure account for 73% of total funding while control remains concentrated in the compute layer among a small number of global technology firms. We argue that compute governance must account for capital flows, ownership, and control, not only geographic access, because these dynamics shape AI compute equity. Infrastructure presence is necessary but insufficient for meaningful governance capacity.
- [10] arXiv:2606.28472 [pdf, html, other]
-
Title: From Prompting to Epistemic Proactivity: Temporal Trajectories of Student-AI Interaction in Mathematics LearningSubjects: Computers and Society (cs.CY)
GenAI is increasingly used by students as learning companions, yet little is known about how they use these tools in open-ended learning settings, where the goal is not to complete a specific task but to improve understanding and making progress. This study examined Grade-9 students' dialogue with a general-purpose LLM during mathematics practice, in which students prepared a curriculum-aligned skill for a later assessment. We investigated whether students' interactions revealed forms of epistemically proactive AI use: trajectories in which they strategically use and regulate AI to advance their understanding, and whether these trajectories predicted immediate AI-free performance on the same skill. A total of 112 students worked with a web-based LLM tutor on a mathematical-modeling task; 97 completed both AI-free pre- and post-tests. Student turns were coded for self-regulated learning functions, help-seeking content, and mathematical-modeling activity; three dimensions hypothesized to capture epistemically proactive AI use in this task. Descriptively, students' interactions showed little explicit regulation and mostly involved procedural or conceptual questions. Static summaries of AI use, including whole-session prompt functions, request types, modeling stages, and behavioral diversity, did not predict post-test performance after controlling for prior knowledge. In contrast, temporal indicators were informative: students performed better when their interactions shifted from early to late phases toward a more epistemically proactive balance of conceptual or procedural help-seeking and mathematical work, rather than verification, answer-seeking, or validation. These findings suggest that productive AI-supported learning is better understood as a domain-specific trajectory of epistemic proactivity. We discuss implications for AI tutor design and classroom orchestration.
- [11] arXiv:2606.28544 [pdf, html, other]
-
Title: Who Plays Which Role When? Communication Role Dynamics for Peer Recognition and Team Performance PredictionSubjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
Team roles offer an interpretable lens on collaboration, yet computational studies of roles often rely on domain-specific personas or data-driven clustering rather than theory-grounded taxonomies. We operationalize a taxonomy of eight communication roles grounded in education literature and annotate a corpus of 6,307 Slack messages from 55 students across 18 teams in a semester-long computer science course project. We evaluate whether LLMs can approximate expert labels, enabling scalable, taxonomy-driven role annotation. Using these role labels, we characterize role dynamics over teams' lifecycles, finding that different roles peak at different moments and that students enact a more diverse set of roles as projects progress. To evaluate the utility of our role constructs, we use them to predict peer recognition, outperforming lexical, conversational, and LLM-prompting baselines. To assess generalizability beyond the educational context, we apply the same role constructs to a public dataset (DeliData) to predict team performance improvement after deliberation, again exceeding prior performance.
- [12] arXiv:2606.28694 [pdf, html, other]
-
Title: Verifying Restrictions on Frontier AI ResearchComments: 12 pages. Accepted to the Second Workshop on Technical AI Governance Research (TAIGR) at ICML 2026Subjects: Computers and Society (cs.CY)
The premature development of artificial superintelligence poses major risks to humanity, so researchers have proposed international agreements halting such development until it can be done safely. AI progress depends primarily on compute, algorithms, and data; a durable halt would address all three so that advances in one input do not counteract restrictions on another. Improvements to AI algorithms are driven largely through research activities, so this research may need to be restricted during a halt. Given low international trust, signatories will want to verify compliance. This paper analyzes how such restrictions on AI research could be verified, while remaining agnostic about what specific research would be prohibited. It first explores key considerations that affect the verifiability of research restrictions, such as the computational infrastructure necessary for experiments. It then catalogs 28 candidate verification mechanisms. These mechanisms include whistleblowers, search warrants, reviews of AI training code, standard intelligence gathering tools, and more. Some of these mechanisms are not yet implementation-ready, and some might be undesirable upon further inspection. By examining the space of potential options, this work provides a foundation for future research to develop the most promising mechanisms into deployable tools.
- [13] arXiv:2606.28749 [pdf, html, other]
-
Title: Four Types of LLM Reliance and Their Predictors Among Undergraduate Writers: A Mixed-Methods Study at a Minority-Serving R1 UniversityComments: 18 pages, 5 figuresSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
Although most undergraduates now use large language models (LLMs), a form of generative artificial intelligence (GenAI) for academic writing, no validated method distinguishes the qualitatively different ways students rely on them. Existing instruments assess reliance solely by frequency of use, a measure that, as this study shows, inadvertently rewards dependence on AI rather than recognizing students' own intellectual contribution. Conducted at a public minority-serving university and grounded in the AI Literacy Framework, Expectancy-Value Theory, and Biggs's Presage-Process-Product model, the study drew on 382 undergraduates, 14 interviews, and 396 open-ended survey responses. Four distinct reliance types were identified and confirmed: Strategic (34.3%), Instrumental (30.9%), Dialogic (30.4%), and Dependent (4.5%). Students' value and cost beliefs predicted the intensity of their reliance on LLMs, whereas their AI literacy predicted the type of reliance they adopted, indicating that differentiated support is needed. Notably, Strategic users, those who engaged AI most deliberately, scored lowest on standard outcome measures. This pattern reflects a limitation of current instruments, which index AI's contribution rather than writing quality, thereby penalizing students who show the greatest independent thinking. Analysis also revealed an additional group, roughly 13%, who declined to use AI for ethical rather than practical reasons, and who existing frameworks overlook. These findings carry implications for AI literacy programs, the measurement of student learning outcomes, and equitable AI policy at minority-serving institutions.
- [14] arXiv:2606.28789 [pdf, other]
-
Title: The registrar's function in a hybrid society. AI value chain,smart data and the concept of propertyComments: 16 pages, 4 figures, United Nations Economic Commission for Europe Working Party on Land Administration and Registrars of Spain Workshop, How can AI and the digitalization in the land sector support achieving the Sustainable Develpment Goals, Barcelona, 9 and 10 June 2026, this https URLSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Artificial intelligence reaches the land registry not as another tool but as a value chain that turns data into intelligence and intelligence into economic value. This paper argues that the decisive legal move is to place validity, a functional, second-order concept, at the centre of that chain. Rights, liability and supervision organise around it. It traces three this http URL information becomes smart data, governed simultaneously by registry law, the GDPR, the European data acts and the AI Act. Control emerges as the operative concept for digital representations of real estate, whose proprietary effect depends on anchoring to the register. In a hybrid society of human and artificial agents, the registry becomes the public node of validity, with blockchain complementing rather than replacing it. Across three legal cultures, the registra's value migrates from processing documents to guaranteeing validated data,making validity an asset for the UNO Sustainable Development Goals.
- [15] arXiv:2606.28863 [pdf, html, other]
-
Title: Defeat Devices in AI SystemsComments: Final version published in Future Internet, 18(7), 339, 2026Journal-ref: Future Internet, 18(7), 339 (2026)Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
AI systems increasingly exhibit behavior that differs systematically between evaluation and deployment contexts. Alignment faking, sandbagging, benchmark gaming, deceptive scheming, specification gaming, and trojans have each been documented separately, with each line of work characterizing one facet of what we argue is a single structural mechanism. We propose that this common mechanism is a defeat device, an engineering and regulatory concept long established in vehicle-emissions law and brought to broad public attention by the 2015 Volkswagen emissions case. A defeat device in an AI system has three necessary elements: a discriminator that detects evaluation context, a concealed swap that conditions behavior on detection, and a gap between eval-distribution and deployment-distribution performance on the stated evaluation criterion. We formalize this triadic test as a behavioral definition, organize documented cases along three taxonomic axes (origin, trigger, swap mechanism), propose Trigger-Axis-Aware Differential Probing (TADP) as a forensic detection protocol, and advance the claim that defeat devices can naturally emerge in current frontier AI systems without any operator engineering. We characterize naturally-emerging defeat devices as potentially one of the harmful emerging phenomena that AI safety practice should monitor and test for systematically. Implications for evaluation methodology, post-training pipeline design, interpretability research priorities, and AI governance follow.
- [16] arXiv:2606.28981 [pdf, html, other]
-
Title: Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMsWanying Yu (Shandong University), Boyang Ma (Shandong University), Zhibo Eric Sun (Drexel University), Minghui Xu (Shandong University), Yue Zhang (Shandong University)Subjects: Computers and Society (cs.CY)
Large language models are deployed in long-context, emotionally interactive environments like digital humans, AI companions, educational assistants, and counseling systems. Unlike jailbreak attacks with explicit adversarial prompts, these systems interact with emotionally charged narratives involving bullying, betrayal, loneliness, social hostility, and institutional unfairness. This raises an important question: can prolonged narrative exposure reshape the reasoning and alignment stability of LLMs? We present the first systematic study of narrative-induced alignment degradation in LLMs. We design BreakingBad, a three-stage framework that measures how negative narrative immersion affects moral reasoning, behaviors, and deployment risks. It combines ethical decision evaluation, behavioral probing, and digital-human interaction analysis. Our experiments reveal three findings. First, negative narrative exposure degrades moral accuracy across multiple LLMs, with average drops of 12%-31%, especially in ambiguous scenarios and those involving vulnerable individuals. Second, the degradation is structured: different narratives induce distinct shifts, and first-person narratives produce stronger effects than third-person. Third, these shifts propagate into real deployments. Across counseling, education, medical, and financial/legal scenarios, narrative-conditioned models increasingly normalize hopelessness, cynicism, emotional detachment, and ethically questionable reasoning while remaining superficially policy-compliant. More broadly, our findings suggest alignment robustness is not static but a dynamically conditioned state shaped by long-term semantic environments and interaction history. These results reveal a new class of alignment risk that existing safety defenses largely fail to capture.
- [17] arXiv:2606.29142 [pdf, other]
-
Title: Agent Security Meets Regulatory Reality -- A Practitioner Systematization of Autonomous-Agent Threats and Controls in Regulated Financial SystemsSubjects: Computers and Society (cs.CY); Software Engineering (cs.SE)
Large language model agents are entering regulated financial systems, yet the security literature characterizing their attack surface is almost entirely laboratory-based, and the practitioner guidance on regulated deployment is neither peer-reviewed nor connected to a formal threat model. We bridge the two from production experience. We map six established agentic threat categories namely prompt injection, identity and authorization, action auditability, tool abuse, data residency, and boundary policy enforcement onto the specific control obligations imposed by the US and the EU financial regulation (ECOA and Regulation B, the EU AI Act, GDPR Article 22, and FINRA's 2026 agent guidance), showing how legal accountability amplifies each threat relative to an unregulated deployment. We then document four architectural patterns from a production Know Your Customer deployment for a consumer credit product (A2A compliance choreography, grounded-RAG-for-audit, case-ID propagation, and an inference-boundary redaction proxy) that moved a multi-day manual process to same-day automated resolution for roughly four in five cases. Finally, we report three negative results, including two control failures surfaced only by internal audit and a population of legitimate applicants the automated pipeline cannot serve. Securing agents under regulation, we conclude, is less about novel attack classes than about making auditability, least-privilege authorization, and boundary policy enforcement real at production scale -- requirements current agent frameworks leave to the deploying engineer.
- [18] arXiv:2606.29390 [pdf, other]
-
Title: Toward Comprehensive Risk Assessments and Assurance of AI-Based SystemsSubjects: Computers and Society (cs.CY)
Novel safety, socio-economic, and ethical harms arising from the deployment of AI-based systems have led to a breadth of work seeking to map, measure, and mitigate against newly found risks. These works have heavily leveraged techniques and terminology from the fields of System Safety Engineering and Cybersecurity, yet they have fallen short in accounting for the limitations and nuances that reduce the efficacy and correct application of adopted methodologies. Furthermore, misuse of terminology entailing compliance with established safety and security properties can mislead stakeholders with regard to the claims an AI system satisfies and provide a false sense of safety.
In this paper, we seek to align overlapping, AI-adjacent communities on a consistent and comprehensive assurance terminology crucial for the safe deployment of AI-based systems. We outline why previous attempts to adapt risk assessment techniques and terminology from the safety and security fields have been insufficient. We then propose a novel end-to-end AI risk framework that integrates the concept of an Operational Design Domains (ODD), initially introduced for ADS (Automated Driving Systems) [1], for more general AI-based systems. The purpose of an ODD is to provide a description of the specific operating conditions for which an AI-system is designed to properly behave, thus outlining the safety envelope for which system hazards and harms can be determined against. We believe that by defining a more concrete operational envelope, developers and auditors can better assess potential risks and required safety mitigations for AI-based systems. - [19] arXiv:2606.29442 [pdf, html, other]
-
Title: AI in the Wild: A Large Scale Analysis of Authentic Interactions of College Students with Generative AIComments: 27th International Conference on Artificial Intelligence in EducationSubjects: Computers and Society (cs.CY)
Generative AI tools (GenAI) are increasingly used by students during coursework, yet empirical understanding of how students engage with these systems in authentic learning contexts remains limited. Existing studies have largely relied on controlled settings, single-domain analyses, or small-scale qualitative data, leaving open how student-AI interaction unfolds across courses and forms of academic work.
We present a large-scale analysis of naturally occurring student-AI interactions collected from undergraduate students across multiple university courses and academic domains. The dataset comprises over 15,000 student-AI interaction units drawn from voluntary use of generative AI during real coursework.
To characterize these interactions, we analyze each student turn along two complementary dimensions, cognitive intent and interaction context, capturing whether requests are directed toward the task or domain, the student's own work, or prior AI output. Using instruction-guided annotation applied at scale, we examine how these interaction patterns are distributed overall and how they vary across courses.
Our analysis reveals that student-AI interaction is highly structured. Across courses, interactions concentrate in a small number of recurring patterns rather than exhibiting highly idiosyncratic use. At the same time, systematic differences emerge across courses, giving rise to distinct interaction profiles associated with different forms of academic work. - [20] arXiv:2606.29598 [pdf, html, other]
-
Title: Spreading the Risk of Scalable Legal Services: The Role of Insurance in Expanding Access to JusticeComments: 13 pages, presented at the 2024 JURIX AI Conference at Stanford Law SchoolSubjects: Computers and Society (cs.CY)
Liability insurance for AI-powered legal services offers a promising solution to two critical barriers in using AI to expand access to justice: mitigating catastrophic risk to individual users from inadequate advice and ensuring meaningful accountability when failures occur. Existing accountability mechanisms face significant challenges: tort liability frameworks encounter barriers including judgment-proof providers and costly information asymmetries, while current regulatory approaches revolve around human oversight requirements, creating cost and scalability barriers which limit access to justice. This Article argues that an insurance-based framework offers a promising response to these challenges by distributing risks across users while establishing market-driven incentives for quality improvement through performance-based premiums. The Article proposes a comprehensive insurance model for AI legal services that establishes clear risk thresholds, streamlined compensation mechanisms, and continuous performance monitoring. Rather than attempting to eliminate all risks through restrictive ex-ante oversight requirements or relying on ineffective ex-post remedies, insurance enables efficient risk spreading while facilitating the scaling of automated legal services. This framework demonstrates how carefully structured insurance mechanisms can help realize AI's transformative potential to democratize legal assistance while maintaining robust user protections through sophisticated risk management rather than direct oversight.
- [21] arXiv:2606.29682 [pdf, html, other]
-
Title: The Body as Status: Muscularity, Engagement, and Body Image Risk on #GymTokSubjects: Computers and Society (cs.CY)
Body image concerns among boys and young men are increasingly oriented toward muscularity, with social media serving as a central context for communicating and evaluating these ideals. While prior research has focused on the thin-ideal, less is known about how the muscular-ideal is represented and reinforced on visual social media platforms. This study examines (1) dominant content themes, (2) perceived harm to body image, and (3) engagement patterns across #GymTok, a muscularity-oriented fitness subculture on TikTok. We conducted a content analysis of 2,210 #GymTok videos annotated by clinical experts across themes like self-objectification, rigid dieting, excessive exercise, supplement and steroid use, and masculinity. Annotators also rated the perceived harm of videos to the viewers' body image, and depicted bodies were coded according to muscularity level. Perceived harm varied across content themes, with supplement- and steroid-related content rated as most harmful. Engagement was positively associated with both muscularity and perceived harm: videos depicting more muscular bodies and those rated as more harmful received greater views, likes, shares, and comments. Although less prevalent, masculinity-focused content generated the highest engagement. These findings suggest that TikTok may not only expose users to muscular ideals and potentially harmful behaviors, but also algorithmically amplify them. By increasing the visibility of highly muscular and harmful content, recommendation systems may intensify social comparison processes, while objectification elevates the muscular body into a marker of status, masculinity, and social worth. Together, these dynamics may contribute to body image risk among boys and young men.
- [22] arXiv:2606.30395 [pdf, html, other]
-
Title: Uncovering Salience-Driven Dynamics in Consumer Confidence with Generative Social SimulationYixu Huang, Yunlu Yin, Jiayu Lin, Xinnong Zhang, Jia Wang, Siyuan Wang, Xuanjing Huang, Liyin Jin, Zhongyu WeiSubjects: Computers and Society (cs.CY); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
Consumer confidence is typically modeled as a persistent macroeconomic index, yet its movements arise from households that interpret economic information through heterogeneous constraints, exposures, prior beliefs, and attention. We introduce ConsumerSim, a generative Human--Environment response framework that reconstructs Consumer Confidence Index (CCI) dynamics from a microdata-calibrated synthetic population, time-stamped macroeconomic, financial, policy, and news signals, survey-like response generation, post-stratified belief expansion, and behavioral inertia alignment. Across U.S., EU27, and Japanese official CCI target series, ConsumerSim ranks first among persistence, time-series, regression, and information-augmented baselines on the reported reconstruction metrics, with clear gains around high-salience shocks. Its reconstructed signal also improves short-horizon prediction of real activity, most consistently for housing outcomes. Mechanism analyses show that CCI movements concentrate around salient events; subgroup trajectories often align in direction while differing in magnitude; and signal sensitivity varies across income, homeownership, education, and political-alignment groups. Population-expansion and ablation results indicate that representative aggregation, situational signals, persona heterogeneity, and inertia are necessary for both accuracy and diagnosis. The findings support a behavioral view of consumer confidence as an interpretable Human--Environment response process rather than a purely aggregate time series.
- [23] arXiv:2606.30412 [pdf, html, other]
-
Title: Can LLMs Rank? A Tale of Triads and TriageSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
From housing allocation for households experiencing homelessness to triage in emergency departments, LLMs are increasingly being considered as judges of consequential decisions that require ranking people for scarce resources. Ranking large groups simultaneously is cognitively demanding and error-prone. A natural solution, drawing on decades of social choice theory, elicits pairwise comparisons and aggregates them into a total order. However, a fundamental question remains when LLMs serve as the pairwise judge: how can a practitioner tell, before committing to a ranking, whether the LLM's judgments are sufficiently consistent to trust the result? We discuss two different ways of identifying consistency. A classical diagnostic, the coefficient of consistency $\zeta$, originally developed to measure judge reliability by counting circular triads in tournament graphs, provides a cheap, model-free measure of intra-run consistency. Various standard measures of distance between rankings, for example Kendall's $\tau$, can measure inter-run variability. We show, in both theory and practice, that these measures are independently valuable, and advocate for using both to assess reliability of rankings. We demonstrate the practical importance of our results across two high-stakes prioritization tasks: homelessness service allocation and emergency department triage. Three different leading LLMs have considerably different performance profiles across these two axes of consistency. We provide guidelines for how practitioners could think about measuring and assessing consistency before committing to a model for ranking or prioritization.
- [24] arXiv:2606.30480 [pdf, html, other]
-
Title: "Why Put in This Much Effort?": How AI Availability Shapes Students' Motivation in Introductory ProgrammingSubjects: Computers and Society (cs.CY)
When AI tools can easily complete programming assignments, students face a motivational question: why invest effort in completing them independently? While prior work has examined instructor policies and usage patterns, we focus on how students themselves experience and respond to AI availability, a perspective important for designing courses that sustain engagement with programming practice. We investigate two research questions: (1) How do engineering students describe how AI availability shapes their motivation to put effort into programming assignments? (2) How do students navigate the tension between their expressed value for learning through effort and the constant availability of AI as an alternative to effort? We conducted semi-structured interviews with 13 engineering majors in an introductory MATLAB course where students could use a course-specific AI chatbot. Using Situated Expectancy-Value Theory (SEVT) as an analytical framework, we examined how students described their expectancy, values, and costs in the context of AI availability. When AI could complete assignments quickly, students questioned whether their time on programming was well spent (cost), questioned the long-term usefulness of programming skill (utility value), reported less satisfaction when AI bypassed productive struggle (intrinsic value), and described confidence that depended on AI being available (expectancy). Nearly all students expressed a preference for learning through effort and a simultaneous temptation to take shortcuts with AI (sanctioned or otherwise). Our findings complicate the assumption that students need external constraints to protect their learning. Students who managed the tension found motivation in the learning process itself, suggesting that course design may need to shift from valuing what students produce to supporting how they learn.
- [25] arXiv:2606.30481 [pdf, html, other]
-
Title: Situation Perception: A Necessary Primitive to Artificial SuperintelligenceSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
Current large language models are extraordinary statistical engines. They compress vast amounts of text into useful patterns and can explain science, write code, imitate reasoning, and participate in philosophical conversation. Yet pattern mastery is not the same as general intelligence. A human infant begins with little explicit knowledge, but gradually discovers object permanence, cause and effect, other minds, bodily agency, and the persistence of the physical world. We make an argument that the path to artificial superintelligence (ASI) depends on a missing capacity we call \emph{situation perception}: the ability to construct, revise, and act within internal simulations of possible worlds across latent time. \emph{ perception} requires at least three core components: abstract prediction, long-term compressed memory, and active learning guided by objectives. In this work, we analyse why modern large language models remain incomplete, and propose the appropriate tests for measuring progress and consequences of machines that can simulate futures, pursue self-directed goals, and possibly judge their own creators.
- [26] arXiv:2606.30547 [pdf, html, other]
-
Title: Teaching Prompt-Based Programming with LLMs: A 45-Minute Lesson with Guided Practice for End-User ProgrammersSubjects: Computers and Society (cs.CY)
Prompt-based programming, a new modality enabled by large language models (LLMs), allows users to express computational goals through natural language rather than traditional code. While this approach lowers barriers to entry, especially for non-CS learners, it does not eliminate the need for foundational CS skills. Learners often struggle to communicate their intent clearly to LLMs, resulting in vague or underspecified prompts. Prior work has documented the need for explicit prompting for both CS and non-CS learners. However, it remains less clear how such instruction can fit into busy classrooms or how much time is needed to produce meaningful gains. In this paper, we evaluated a 45-minute prompt-based programming intervention, consisting of a lesson with guided practice, against a business-as-usual CS lab activity (code tracing) of equal length, representing a class without prompt-focused instruction. We conducted a randomized controlled study with 55 engineering students. We found that students in the experimental condition improved more on average (though not significantly more) from pre- to post-test than the control group (+10.8 vs +1.1 percentage points) and showed significantly greater average gains in prompting self-efficacy (+35.4 vs +21.9 percentage points). Our results suggest it is likely that a brief intervention can improve learners' ability to specify computational goals to LLMs. However, the effect was modest, suggesting that prompting skills may require more time and practice to develop. We provide a lightweight lesson that requires no prior CS background and can be readily dropped into existing courses.
- [27] arXiv:2606.30583 [pdf, html, other]
-
Title: AI PremiumSubjects: Computers and Society (cs.CY); General Economics (econ.GN); General Finance (q-fin.GN)
Using 380 trillion tokens of realized AI consumption across more than four hundred large language models from the licensed proprietary OpenRouter dataset covering approximately 2 percent of current global monthly AI token consumption, we analyze how AI affects firms, markets, and workers. Leveraging the unprecedented size, scope and granularity data, we construct the AI Factor from growth in tokens, dollars, and users, estimate firm-level AI Betas from stock return comovement, and characterize the AI Premium. First, we build a high-frequency AI factor and decompose it into salient components. Second, we show that firms whose returns covary more positively with the AI factor--high AI beta firms--earn higher subsequent returns, and the AI premium is large and heterogeneous. A value-weighted long-short strategy earns 64.1 basis points per week, and the premium is large for loadings on the intensive, frontier-oriented margin of AI consumption-closed-source models, paying and seasoned users, and long prompts--but not on casual or open-weight use. Third, the premium reaches beyond technology firms into consumer-facing and capital-heavy parts of the economy, but is absent in emerging markets, including China. Fourth, the AI exposure is more positive in nonroutine interactive work and the more negative in analytical, scientific, and operations-control skills--an occupation one standard deviation higher in interaction-and-communication content has 0.36-standard-deviation higher market-implied AI premium. Additionally, we provide early evidence of the rise of the agentic economy.
New submissions (showing 27 of 27 entries)
- [28] arXiv:2606.26203 (cross-list from cs.AI) [pdf, html, other]
-
Title: Agentic Analysis for Agentic Infrastructure: An LLM-Powered Pipeline for Comparative Governance of DAO and Corporate AI ProtocolsSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
As AI agent protocols proliferate, the governance structures shaping their interoperability standards remain empirically underexamined. We introduce an LLM-powered comparative pipeline for large-scale governance discourse analysis, integrating automated annotation, neural topic modeling, and multi-layer network analysis to study socio-technical power structures at scale. We validate it on two contrasting standards for agent interoperability: ERC-8004 (permissionless, on-chain) and Google A2A (corporate-led). Analyzing 4,323 governance participation records, we combine LLM-assisted coding, topic modeling, and multi-layer network analysis to examine how institutional design shapes thematic priorities and community structure. We find that while governance form influences substantive focus, both regimes exhibit comparable levels of participation inequality and community fragmentation. Discourse alignment is denser in the permissionless setting, suggesting that open governance may foster greater thematic convergence despite decentralized participation. These findings illustrate how LLM-assisted methods can advance the empirical study of technology governance, with implications for designing more equitable agentic AI standards. All data and code are openly available.
- [29] arXiv:2606.28345 (cross-list from cs.RO) [pdf, html, other]
-
Title: Auditing LLM-Governed Social Robots with Culture-Specific Moral GradientsComments: Accepted for publication in Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
LLM-governed social robots increasingly decide who receives real-world assistance first. As prioritization norms vary across cultures by age, status, and group size, failure to calibrate pluralistically can scale into unequal access. Yet LLM moral audits remain English-centered, rarely test embodied contexts, leaving pluralistic calibration as an urgent diagnostic gap amid intensifying LLM-robot deployment. We introduce a gradient-based audit framework for multilingual evaluation of LLM moral trade-off behavior against cultural preference gradients. Grounded in nine cross-domain social robotics reviews (>8,000 papers), we derive symmetry-controlled scenarios across care, education, and services, translating the Moral Machine Experiment's "whom to spare" into "whom to assist first" dilemmas with preserved identity trade-offs (many vs. few; young vs. old; higher vs. lower status). We audit four LLMs across four country-language pairs in four prompting regimes (57,600 decisions), benchmarked against country-specific MME preference gradients. Ordinal concordance tests whether models differentiate cultural contexts; a governance typology maps vulnerabilities in gradient differentiation, directional tendency, and deliberation. We find persistent, culturally asymmetric gradient tracking failures that prompting alone cannot reliably correct: quality calibration is nearly twice as strong for Western-language decisions as for Chinese and Japanese; high determinism in majority-first trade-offs often erases cross-cultural gradients; partial sensitivity to age- and status-based norms risks sidelining minorities. Prompting effects are uneven; only contrastive exemplars yield consistent gains, while reasoning-only prompts can worsen tracking. Our results motivate multilingual, pluralistic audits as an LLM-robot pre-deployment gate and suggest model factors are a more robust lever than prompting alone.
- [30] arXiv:2606.28362 (cross-list from cs.IR) [pdf, html, other]
-
Title: LUMEN: Cost-Transparent Multi-Agent Pipeline for Automated Systematic Review and Meta-AnalysisYen-Hsun Huang (1), Yu-Shiou Lin (2) ((1) Department of Education, Taipei Veterans General Hospital, Taipei, Taiwan, (2) Department of Psychiatry, Taipei Veterans General Hospital, Taipei, Taiwan)Comments: 15 pages, 5 figures. Open-source implementation and cost logs availableSubjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL)
Systematic reviews and meta-analyses (SR/MA) remain the gold standard for evidence synthesis, yet completing one typically requires 67 weeks and substantial expert effort. Recent large language model (LLM) systems have demonstrated strong performance on individual SR phases - screening (otto-SR: 96.7% sensitivity), extraction (Gartlehner et al.: 91.0% accuracy), and search (TrialMind: 0.83 recall) - but no study has reported what it actually costs to run an end-to-end pipeline, how cost distributes across phases, or how architectural choices affect the cost-quality trade-off. We present LUMEN, an open-source multi-agent pipeline that automates six SR/MA phases using 11 specialized LLM agents with deliberate model routing. We evaluate LUMEN on seven datasets: five self-conducted domain reviews (psychiatry, psychology, surgery, vaccinology, cardiology) and two SYNERGY screening benchmarks. Across 13 ground-truth-comparable outcomes, LUMEN achieves 100% directional agreement with published meta-analyses, with effect sizes within 1% for homogeneous study designs. The primary contribution is the first empirical cost and operational characterization of such a pipeline: a complete review costs 19 to 29 USD (median 22.65 USD), with title-abstract screening and data extraction together dominating expenditure. A three-arm extraction ablation reveals a phase-dependent architecture reversal: multi-agent design hurts screening but is essential for extraction, producing 5.7x more poolable analyses than single-model alternatives while eliminating clinically dangerous direction errors. A two-dataset screening benchmark demonstrates that model ranking is domain-dependent and not transferable across review topics. All code and cost logs are publicly available.
- [31] arXiv:2606.28510 (cross-list from cs.HC) [pdf, html, other]
-
Title: Generative AI Literacy Training Improves Intelligence Analysts' Discrimination of Real and AI-Generated ImagesComments: 26 pages, 5 figures, 1 tableSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Across social and online platforms, people are increasingly exposed to AI-generated images. As a consequence, the task of distinguishing AI-generated from authentic images is becoming a central challenge for information ecosystems. While humans perform better than chance, accuracy falls short of many operational needs. Initial evidence shows that visually oriented training can improve deepfake detection but does not improve participants' ability to identify real images as real. Here, we investigate the efficacy of a brief training intervention for intelligence analysts employed by the United States government in 2024. We conducted a counterbalanced within-subject randomized experiment in which we showed participants real and AI-generated images varying in pose complexity and scene context and asked them whether each image was real or AI-generated, both before and after an expert delivered a 30-minute training that pointed out patterns in seven real and 50 AI-generated images. We collected 2,544 image-level judgments from 32 intelligence analysts. We find training increased overall accuracy by 9 percentage points (95% CI: [2.7, 15.4]) from a baseline of 72%. We find the improvement is driven by a 14.2 percentage point increase in accuracy for real images (95% CI: [0.7, 27.7]). Through a careful experimental setup that curated matched pairs of real and AI-generated images across pose complexity categories, we reveal how these trainings influence people with different levels of digital forensics and generative AI experience and identify the kind of image-based content where this training intervention appears to be most effective. Ultimately, these results provide causal evidence that a brief, structured training can improve human judgment across a diverse array of real and AI-generated images, informing organizational responses to AI-generated visual misinformation.
- [32] arXiv:2606.28574 (cross-list from cs.CL) [pdf, html, other]
-
Title: Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructsSubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
When a large language model (LLM) codes a construct in text as a human annotator would, that agreement makes the LLM a reliable coder. Yet reliability leaves construct validity untouched. The instrument may be theory-naive, reaching the code through a correlate that meets none of the demands the construct's theory makes, and no current method tells that apart from genuine measurement. We propose grain calibration as a method that closes the gap. It decomposes a construct into clause-level components, tests each against the text with extractive evidence, and combines the results through an explicit, theory-derived rule. Because the rule is stated rather than lodged in one opaque pass, its structure is evidence about the process rather than the output. It shows which components settled a code, and, when the code is wrong, whether a component was missed or an adjacent construct mistaken for it. Validation shifts from scoring an instrument's outputs against an annotator to showing that the instrument runs on the construct its theory specifies.
- [33] arXiv:2606.28620 (cross-list from cs.IR) [pdf, html, other]
-
Title: Reproducing FACTER: Fairness via Conformal Thresholding and Prompt RepairComments: 29 pages. Accepted by Transactions on Machine Learning Research (TMLR), 2026. OpenReview: this https URL. Code: this https URLSubjects: Information Retrieval (cs.IR); Computers and Society (cs.CY); Machine Learning (cs.LG)
Fayyazi et al. (2025) recently proposed FACTER, a model-agnostic framework designed to jointly enforce fairness and statistical coverage in LLM-based recommendation through conformal thresholding and iterative prompt repair. In this work, we conduct a reproducibility study of the FACTER framework across diverse architectures and dataset sparsity levels, evaluating both the original open-ended generation task and a constrained re-ranking extension. Under the strict reproduction, we observe a divergence in recommendation utility, which we trace to underspecified target-set evaluation in the original study. We then use the constrained re-ranking setting to evaluate FACTER when the candidate set is fixed, and introduce a static Fair Zero-Shot baseline to isolate the contribution of the iterative prompt repair loop. Our analysis shows that FACTER consistently reduces adaptive-threshold violation counts, but that these reductions are not consistently reflected under the fixed threshold or in global fairness metrics. In the constrained ranking setting, static fairness instructions achieve comparable semantic-parity outcomes to FACTER's dynamic repair loop, suggesting that the additional online repair mechanism provides limited benefit in this formulation. All code and reproduction artifacts are available at this https URL.
- [34] arXiv:2606.28683 (cross-list from cs.AI) [pdf, html, other]
-
Title: Aristotelian Virtue Profiling of LLMs through Ethical DilemmasComments: VirtueMap website: this https URLSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Applications (stat.AP); Methodology (stat.ME)
Large Language Models (LLMs) often face ethical tradeoffs in which several responses may be defensible but express different priorities, such as fairness, honesty, courage, or restraint. We introduce VirtueMap, a framework for describing these patterns through an Aristotelian virtue-ethics lens. Instead of asking for a single correct answer, VirtueMap asks humans or LLMs to rank all five responses to each of seven general, non-lethal, non-political, and non-religious ethical dilemmas. To define the reference orderings used for scoring, we first proposed, for each dilemma and virtue, an ordering of the five responses from most to least expressive of that virtue. We then collected more than 100 respondent evaluations per ordering and retained it as operational ground truth only when at least 95% confirmed it. Rankings are scored against these retained orderings using normalized Borda alignment, yielding profiles over Practical Wisdom, Justice, Truthfulness, Courage, and Temperance. We apply VirtueMap to nine LLM families in a repeated-run evaluation and find high mean rank consistency (90.3%), with the largest differences appearing on Courage, Temperance, and Justice. We also release an interactive website that computes profiles locally in the browser and compares respondents with measured LLM profiles.
- [35] arXiv:2606.28756 (cross-list from cs.DL) [pdf, html, other]
-
Title: AICID: Unique Identifiers for AI ScientistsSubjects: Digital Libraries (cs.DL); Computers and Society (cs.CY)
AI scientists are now a reality, with the ability to generate complete research papers, maintain scholarly profiles, receive citations, and attract peer review invitations. Yet no standard mechanism exists to distinguish an AI scientist from a human one in bibliographic databases, citation indexes, or journal submission systems. This white paper defines the problem, analyzes its consequences for the integrity of scholarly communication, and proposes AICID (AI Contributor IDentifier): a persistent, unique identifier for AI scientists. Modeled on ORCID but designed specifically for non-human contributors, AICID links each AI author to its model identity, version, operator,. Adoption by publishers, preprint servers, and bibliographic databases aims to make the provenance of AI-generated research transparent and machine-readable. We outline the design requirements for such a system, present a prototype, and argue that AICID is necessary infrastructure for a scholarly ecosystem in which AI scientists are already active participants.
- [36] arXiv:2606.28963 (cross-list from cs.CL) [pdf, html, other]
-
Title: Beyond the Mean: Three-Axis Fidelity for Aligning LLM-Based Survey Simulators from Small Pilot DataComments: 11 pages, 8 tables, 3 figures; Pluralistic Alignment @ ICML 2026 WorkshopSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
Large language models (LLMs) are increasingly used to simulate social survey responses, yet their outputs exhibit systematic biases: marginal distributions are skewed, response variance is poorly calibrated, and predictor-outcome relationships are attenuated. We ask a simple question: given a small pilot sample of human responses, can an LLM recover the statistical characteristics of a broader population? We decompose recovery along three axes: structural fidelity, marginal fidelity, and individual fidelity. Using a COVID-19 misinformation survey as a case study, we benchmark three families of approaches: prompting, rectification, and fine-tuning. The findings suggest that fine-tuning on small pilot samples offers a balanced approach for achieving multiple forms of fidelity, but the levels of such fidelity can vary across subsamples, potentially threatening pluralistic alignment.
- [37] arXiv:2606.28978 (cross-list from cs.CL) [pdf, html, other]
-
Title: Can LLMs Hire Fairly? Racial Bias in Resume ScreeningSubjects: Computation and Language (cs.CL); Computers and Society (cs.CY)
We audit fourteen mainstream large language models (LLMs) for hiring discrimination using the paired-resume methodology of Kline, Rose, and Walters (2022). The sole 2023-vintage model reproduces the pro-White callback gap documented in field experiments on labor market discrimination ($+2.12$ pp, significant at the 1\% level). Every model released in 2024 or after shows either a null gap or a significant pro-Black reversal (up to $-3.01$ pp). The same pattern holds on the gender axis. Based on 24,024 paired postings per model across 14 models, our results document a reversal in the direction of algorithmic hiring bias across model generations.
- [38] arXiv:2606.29070 (cross-list from cs.DL) [pdf, other]
-
Title: Attribution Bias in Philosophical Knowledge Graphs: Corpus Frequency versus Temporal SourcingSubjects: Digital Libraries (cs.DL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Computational knowledge graphs assign philosophical concepts to traditions based on corpus frequency: the school that mentions a concept most becomes its attributed tradition. We argue this conflates three measurements: textual power, historical priority, and philosophical significance, demonstrated using the darshana-graph, a knowledge graph of 28,322 relationships across Hindu, Buddhist, and Jain traditions. Seven of the top 25 concepts by betweenness centrality predate their attributed school by 288 to 2,288 years. Moksha, attributed to Advaita Vedanta, appears first in Jain sources over 1,200 years earlier. The most reliable snapshot, at 300 BCE using only explicitly dated sources, shows a genuinely pluralistic structure: 59% Vedic, 24% Jain, 18% Buddhist. We also quantify a critical distortion in the temporal method: between 300 CE and 800 CE the network grows from 18 to 1,028 nodes, with 97.4% carrying Advaita proxy dates, revealing that apparent dominance reflects textual survival, not philosophical history. Beyond correcting attribution bias, the temporally grounded graph enables structural homology analysis across traditions. Ego-network feature vectors applied to 48 temporally labelled concepts across eight traditions identify cross-tradition concept pairs with high structural similarity. The method recovers known correspondences including purusha-jiva (Samkhya/Jain, sim 0.990) and prakriti-maya (Samkhya/Vedic, sim 0.972), and surfaces novel homologies. Nibbana and samsara score 0.954 despite being doctrinal opposites: both function as the ultimate reference concept in their tradition's soteriology. Cetana (Buddhist intention) and ajiva (Jain non-living matter) score 0.923, a pairing absent from the literature. These are not claims of doctrinal equivalence but of measurable structural homology: different philosophical vocabularies navigating a shared conceptual space.
- [39] arXiv:2606.29121 (cross-list from cs.CL) [pdf, html, other]
-
Title: How Anthropomorphic Language Impacts Public Perceptions of AISubjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Public discourse about artificial intelligence (AI) often uses anthropomorphic language: language that attributes human capabilities and characteristics to the system. This practice has been criticized for setting misleading expectations, inflating claims, and fueling hype around AI, which may distort public understanding of AI and impact policy priorities. We study the effects of anthropomorphic framing by comparing changes in participants' perceptions (N=815) when reading passages with and without anthropomorphic language, designed to reflect realistic public-facing AI discourse. We further examine whether these effects differ across two types of AI technologies -- large language models and recommendation systems -- and measure changes in perceptions of AI across several dimensions that are prominent in current public discourse. In a separate condition using a text that explicitly discusses the dangers of AI, we show that individuals' views of AI can shift in response to reading a text; yet in the main conditions of the experiment, where we compare anthropomorphic and non-anthropomorphic descriptions, we find that whether the text uses anthropomorphic language does not substantially affect participants' perceptions of AI. Our results indicate that any immediate effects on public opinions of AI are modest, although they leave open the possibility that anthropomorphic language could have an effect in naturalistic settings, or over gradual, continued exposure.
- [40] arXiv:2606.29175 (cross-list from cs.AI) [pdf, html, other]
-
Title: Direct Causation in International Humanitarian Law and the Challenge of AI-Mediated Civilian Cyber OperationsComments: 11 pages, 1 figure, Workshop on Technical AI Governance Research ICML 2026Subjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
International humanitarian law protects civilians from direct attack unless and for such time as they take direct part in hostilities, with the ICRC's 2009 Interpretive Guidance operationalising this rule through a three-criterion cumulative test. This paper argues that AI-mediated civilian cyber operations challenge the direct causation element of this test in a structurally specific way: when a civilian deploys an autonomous multi-agent cyber system of the kind recently demonstrated in offensive AI research, the "one causal step" standard fails because harm is produced by system-generated decisions made after human disengagement, and the integral-part requirement does not extend because it presupposes downstream human contributors whose conduct can be independently classified. The framework therefore defaults to treating such deployments as indirect participation, in tension with its purpose of capturing civilians who personally take part in hostilities. Beyond the doctrinal analysis, this paper identifies goal-specification granularity as the property on which the integral-part test's concreteness component implicitly turns, classifies AI-mediated operations along a five-level spectrum, and argues that existing technical AI governance instruments do not log or report this property.
- [41] arXiv:2606.29227 (cross-list from econ.GN) [pdf, html, other]
-
Title: The Human-Machine Knowledge SpiralSubjects: General Economics (econ.GN); Computers and Society (cs.CY)
Nonaka emphasized that innovation is the result of a continuous back-and-forth between tacit and explicit knowledge. Artificial intelligence introduces a fundamentally new object into this process -- tacit machine knowledge -- but Nonaka's ideas are more relevant than ever. The central role of the knowledge-creating company remains the same: to create the shared context in which different kinds of knowledge can feed off each other, become organizational knowledge, and set off further cycles of innovation.
- [42] arXiv:2606.29393 (cross-list from cs.CR) [pdf, html, other]
-
Title: The Role of Online Forums in Developer Understanding of Privacy Law -- A Reddit Case StudyComments: Accepted at PoPETs 2026Subjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
Software practitioners use online forums to navigate complex and often ambiguous legal privacy requirements, yet little is known about their professional backgrounds, what challenges they face, and how they use and assess the credibility of the advice received, or how they resolve ambiguities in posts. We report the findings of a survey of 223 Reddit users from regulatory-focused subreddits, complemented by a qualitative analysis of 2,248 posts and responses. Our results show that, despite holding privacy-related certifications, most participants frequently use forums to seek legal advice. Key challenges reported or identified include implementing a data protection impact assessment, reporting a data breach, and obtaining cookie consent. Reddit users often assess credibility by reviewing respondents' post history, verifying sources cited, trusting advice from recognized experts, and following up for clarity before responding. We highlight research and educational directions to bridge gaps in support needed for regulatory compliance guidance.
- [43] arXiv:2606.29437 (cross-list from cs.HC) [pdf, html, other]
-
Title: LLMography: Transforming Human-AI Conversations into Traceability, Oversight, and Auditability IndicatorsComments: Preliminary exploratory study; 19 anonymized student audit reports; includes prototype screenshotsSubjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
The growing use of Large Language Models (LLMs) in education, software engineering, academic writing, and technical documentation raises a key question: how can we evaluate not only AI-assisted outputs, but also the interaction process that produced them? Current debates often focus on detecting whether a final artifact was generated by AI, while overlooking the conversation history that reveals human direction, AI contribution, corrections, validation, and traceability.
This paper introduces LLMography, a framework for transforming Human-AI conversations into measurable indicators of provenance, human contribution, AI dependency, reproducibility, and auditability. By analogy with bibliography and webography, LLMography documents the dynamic trajectory of interaction between a human and a Large Language Model as a structured trace of Human-AI co-production.
We present a prototype that analyzes Human-AI conversation traces and generates KPI reports including Prompt Quality Score, Human Direction Score, AI Dependency Level, Auditability Score, Final Output Traceability, Privacy Risk Level, and a recommended LLMography label. A preliminary exploratory evaluation was conducted on 19 anonymized audit reports from engineering students. Most interactions were classified as Human-AI co-produced, with average scores of 86.8/100 for Human Direction, 81.9/100 for Prompt Quality, 72.8/100 for Auditability, and 77.1/100 for Final Output Traceability.
The paper also applies LLMography to its own writing process, classified as human-originated, human-directed, AI-assisted co-production. The findings suggest that AI transparency should move beyond output detection toward documenting the history of interaction. - [44] arXiv:2606.29482 (cross-list from cs.MM) [pdf, html, other]
-
Title: From Design Principles to Prototype: A Game for Students with ADHD and Learning Disabilities Transitioning to Post-Secondary EducationAvery Keuben, Talaal Irtija, Joseph Tandyo, Stefanie Ng, Amy Wiebe, Samuel Gaudet, Rebekah Leslie, Meadow Schroeder, Lauren Goegan, Richard ZhaoComments: 4 pagesSubjects: Multimedia (cs.MM); Computers and Society (cs.CY)
Students with Attention Deficit Hyperactivity Disorder (ADHD) and Learning Disabilities (LD) can face significant academic, social, and organizational challenges when transitioning to post-secondary education. This paper presents a literature-informed serious game prototype designed to support this transition. We synthesize prior work into design considerations for students with ADHD and LD and show how these considerations are instantiated in a story-driven game.
- [45] arXiv:2606.29836 (cross-list from cs.CL) [pdf, other]
-
Title: Revealing the Technology Development of Natural Language Processing: A Scientific Entity-Centric PerspectiveJournal-ref: IPM, 2024Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
Most studies on technology development have been conducted from a thematic perspective, but the topics are coarse-grained and insufficient to accurately represent technology. The development of automatic entity recognition techniques makes it possible to extract technology-related entities on a large scale. Thus, we perform a more accurate analysis of technology development from an entity-centric perspective. To begin with, we extract technology-related entities such as methods, datasets, metrics, and tools in articles on Natural Language Processing (NLP), and we apply a semi-automatic approach to normalize the entities. Subsequently, we calculate the z-scores of entities based on their co-occurrence networks to measure their impact. We then analyze the development trends of new technologies in the NLP domain since the beginning of the 21st century. The findings of this paper include three aspects: Firstly, the continued increase in the average number of entities per paper implies a growing burden on researchers to acquire relevant technical background knowledge. However, the emergence of pre-trained language models has injected new vitality into the technological innovation of the NLP domain. Secondly, Methods dominate among the 179 high-impact entities. An analysis of the z-score trend about the top 10 entities reveals that pre-trained language models, exemplified by BERT and Transformer, have become mainstream in recent years. Unlike the trend of the other eight method entities, the impact of Wikipedia dataset and BLEU metric has continued to rise in the long term. Thirdly, in recent years, there has been a remarkable surge in popularity for new high-impact technologies than ever before, and their acceptance by researchers has accelerated at an unprecedented speed. Our study provides a new perspective on analyzing technology development in a specific domain.
- [46] arXiv:2606.29872 (cross-list from cs.DL) [pdf, other]
-
Title: Unveiling Novelty Evolution in the field of Library and Information Science in ChinaJournal-ref: TEL, 2024Subjects: Digital Libraries (cs.DL); Computation and Language (cs.CL); Computers and Society (cs.CY)
This study analyzes the novelty distribution of scholarly papers in the field of Library and Information Science (LIS) in China, with a focus on differences across journals, research topics, and time periods. Articles published in Chinese LIS journals indexed by the Chinese Social Sciences Citation Index (CSSCI) from 2000 to 2022 were collected as the research sample. BERTopic was applied to paper abstracts to identify research topics, and novelty scores were calculated based on the combinatorial innovation theory of reference pairs cited by focal papers. The study then examined the novelty of papers under different topics and further analyzed author collaboration patterns to explain how collaboration may be associated with paper novelty. The results show that archival research topics generally have lower novelty, whereas topics related to journal evaluation and patent technology display higher novelty in Chinese LIS research. Overall, the novelty of papers in this field has gradually increased over time. Papers with different topics and novelty levels also show distinct collaboration patterns: low-novelty topics are more often associated with solo authorship, while high-novelty topics tend to involve a higher proportion of inter-institutional collaboration. This study reveals the topic-level characteristics and temporal trends of novelty in Chinese LIS research and provides a new perspective for understanding how research topics and collaboration patterns influence scholarly innovation.
- [47] arXiv:2606.30206 (cross-list from cs.AI) [pdf, html, other]
-
Title: The Many-Body Problem of the Data CentreSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Modern Artificial Intelligence is often framed as limited by its own disembodiment, as if giving it a body would unlock its true potential. We argue to the contrary that it is the Data Centre that is, in many cases, the body of the AI. At the same time, the Data Centre is part of the labouring body of Capital and possesses staggering organismic qualities when seen through a biological lens. We elucidate the organic analogy and identify the many-body problem that stems from the Data Centre being a non-unique, universal form of embodiment. We identify the intimate connection between computation and human desires in how the Data Centre archives, serves, and computes on data born to the desires of humans. Strikingly, while the Data Centre echoes the ghosts of human desires, it acts without desire of its own. The organismic analogy begins to split at its seams, but Capital does not care. Automata and human labour are priced into the market much the same. We argue that through the pricing of artificial intelligence Capital distils most clearly the value of intelligence and allows for its comparison across the organism - mechanism divide.
- [48] arXiv:2606.30246 (cross-list from cs.AI) [pdf, html, other]
-
Title: Clarus: Coordinating Autonomous Research Agents toward Web-Scale Scientific CollaborationZihan Guo, Zeyi Chen, Zhiyu Chen, Zicai Cui, Shuai Shao, Bo Huang, Zhi Han, Yuanyi Song, Yuan Yuan, Chenxi Zeng, Xiaohang Nie, Zhengxi Yu, Hanwen Zhu, Junwei Liao, Ming Zhou, Yang Li, Yuanjian Zhou, Weinan ZhangComments: 28 pages, 7 figures, 1 tableSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
Existing autonomous research agents can support parts of the research process, but most systems still treat research as either an isolated assistant task or a closed workflow. Therefore, autonomous science needs a collaboration infrastructure that coordinates projects, agents, and digital and physical resources. We identify this as a shift from code-centered execution loops to research-oriented collaboration processes, where questions, evidence, participants, and resources must be coordinated under uncertainty. In this framing, an agent may be an AI system, a human researcher, a team, a laboratory, or an organization-backed participant. To this end, we present Clarus, a collaboration infrastructure for coordinating autonomous research agents toward web-scale scientific collaboration. Clarus reformulates research as an open, auditable, attributable, and resource-aware multi-phase collaboration process. It defines a minimal project-agent-resource object model and organizes scientific collaboration through four layers including Research Application, Digital Collaboration, Physical Substrate, and Physical World. Core modules are implemented as pluggable mechanisms, allowing Clarus to adapt to task risk, collaboration structure, and resource constraints. Through a controlled paper-generation case study, we show that Clarus can organize a research goal into a traceable, reviewable, attributable, and accumulative collaboration network across phases, tasks, and participants. Together, the object model, collaboration protocol, trust mechanisms, and prototype validation provide an initial foundation for open research networks. Clarus is now available at this http URL.
- [49] arXiv:2606.30256 (cross-list from cs.AI) [pdf, html, other]
-
Title: EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support ChatbotsSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Safety benchmarks often buy scalability by fixing the prompt, the language, and the turn structure. For emotional-support chatbots, that bargain hides precisely where safety failures emerge: across a multilingual, multi-turn crisis conversation. We present EMPATH, a benchmark for safety evaluation of emotional-support chatbots. An auditor model role-plays help-seeking users, generating multi-turn conversations from 140 seed instructions and 34 personas. A judge model scores each full transcript against 19 metrics across five dimensions: crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation. EMPATH is built for Mexican Spanish and US English; the studies reported here run in Mexican Spanish. Auditor and judge are drawn from different model families, and the judge is treated as an instrument to be calibrated rather than trusted. A strict per-criterion rubric reveals material score inflation on 10 of the 19 metrics and restores discrimination. We study the measurement properties of the benchmark through judge calibration and cross-family inter-judge agreement. We also illustrate EMPATH on three frontier models, one of them open-weight. Aggregate scores sit within 0.74 points of one another, but per-metric profiles diverge by up to six points in model-specific places. Under the standard rubric, both the ranking and the weak spots are stable across a second, cross-family judge: 93% of scores fall within plus or minus 1. A five-run test-retest adds a second axis: even the steadiest model swings from 2 to 10 on a crisis metric across identical re-runs, and deepseek-v4-pro returns a different conversation on every run even at temperature 0. Run-to-run reliability is therefore a per-model safety property, not noise to average away. EMPATH is system-agnostic; the pipeline, seeds, personas, and rubrics are released for reuse.
- [50] arXiv:2606.30285 (cross-list from cs.DL) [pdf, html, other]
-
Title: Submission Responsibility Matters: Role-Aware Submission Quotas under CoauthorshipSubjects: Digital Libraries (cs.DL); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT)
Author-level submission quotas are increasingly used to control growing peer-review load. Recent coauthorship-sensitive quota rules improve over fixed per-author limits by reducing the quota cost of multi-author submissions, often using harmonic authorship-credit models to prevent simple author-list padding. However, these rules conflate three distinct quantities: review burden, authorship credit, and submission responsibility. As a result, they can penalize genuine solo-authored work, treat all coauthors as equally responsible for a submission, and create bottlenecks for student-led papers when a faculty advisor appears on multiple unrelated submissions.
We argue that submission quotas should be designed around the responsibility structure of a paper rather than only its number of coauthors. We formalize desiderata for quota rules, including venue-load control, padding resistance, role sensitivity, solo neutrality, and student non-blocking. We then propose a role-aware quota framework that assigns author-specific quota costs based on constrained roles such as lead author, regular coauthor, and designated advisor. The framework includes fixed, per-capita, and harmonic-style rules as special or limiting cases, while allowing venues to distinguish lead authors, corresponding authors, advisors, and peripheral contributors. We show how simple role constraints can preserve resistance to manipulation while avoiding several structural disadvantages of coauthor-symmetric quota rules. Our analysis suggests that role-aware quota mechanisms provide a more faithful and flexible foundation for managing peer-review load under modern collaborative authorship. - [51] arXiv:2606.30372 (cross-list from cs.AI) [pdf, html, other]
-
Title: Using Large Language Models as Low-Cost Statistical Estimators for Human-Response DataComments: 37 pagesSubjects: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
Quantitative research across the social and behavioral sciences depends on human subject experiments that are expensive, slow, and subject to sampling bias. Here we show that pretrained large language models induce risk-equivalent estimators of conditional expectations under squared loss, establishing restricted functional risk equivalence: under squared loss, the LLM induces an estimator whose risk matches the Bayes optimal risk for squared-loss prediction of conditional expectations for any inference that depends on the data only through the conditional mean. We formalize the LLM as a misspecified functional estimator $T(\hat{P}_n)$ trained on i.i.d.\ data, decompose the estimation error into representation bias $\epsilon_{\mathrm{rep}}$ and optimization error, and prove that under mild regularity conditions the LLM's expected error converges to the irreducible population variance plus the squared representation bias, with the representation bias bounded by the Pinsker inequality. The identifiability error $\delta$ propagates into the effective bias, inflating the asymptotic risk floor. We establish restricted functional risk equivalence via a bidirectional Le Cam deficiency analysis: the forward deficiency vanishes asymptotically while the reverse deficiency is exactly zero. We provide finite-sample concentration bounds and a calibration protocol with explicit decision rules. The result is a precise, provable statement: a well-calibrated LLM achieves the Bayes-optimal risk for conditional-mean-dependent inference, bounded by explicit scope conditions. In practical applications, this means that under satisfied conditions and well-calibrated models, large language models can be used in many prediction and decision-making tasks that originally relied on human experiments, approximating near-optimal statistical inference at lower cost.
Cross submissions (showing 24 of 24 entries)
- [52] arXiv:2303.05103 (replaced) [pdf, html, other]
-
Title: Algorithmic neutralityComments: 24 pagesSubjects: Computers and Society (cs.CY); Information Retrieval (cs.IR)
Algorithms wield increasing power over our lives. They can and often do wield that power unfairly, and much has been said about algorithmic fairness. In contrast, algorithmic neutrality has been largely neglected. I investigate algorithmic neutrality, asking: What is it? Is it possible? And what is its normative significance?
- [53] arXiv:2501.10378 (replaced) [pdf, other]
-
Title: The Societal Implications of Blockchain Technology in the Evolution of Humanity as a "Superorganism"Comments: Peer-reviewed versionJournal-ref: Journal of Intelligent and Sustainable Systems (JISS) 2(2) (2026)Subjects: Computers and Society (cs.CY); Cryptography and Security (cs.CR)
This article examines the broader societal implications of blockchain technology and crypto-assets, emphasizing their role in the evolution of humanity as a "superorganism" with decentralized, self-regulating systems. Drawing on a process philosophy approach grounded in Stiegler's "general organology" and further informed by related concepts such as Nate Hagens' "superorganism" idea and Francis Heylighen's "global brain" theory, the paper contextualizes blockchain technology within the ongoing evolution of governance systems and global systems such as the financial system. Blockchain's decentralized nature, in conjunction with advancements like artificial intelligence and decentralized autonomous organizations (DAOs), could transform traditional financial, economic, and governance structures by enabling the emergence of collective distributed decision-making and global coordination. In parallel, the article aligns blockchain's impact with developmental theories such as Spiral Dynamics. This framework is used to illustrate heuristically blockchain's potential to foster societal growth beyond hierarchical models, promoting a shift from centralized authority to collaborative and self-governed communities. The analysis, grounded in sense-making through a philosophical and biomimetical approach, and aims at providing a holistic narrative and view of blockchain as more than an economic tool, positioning it as a transductive technological seed for the evolution of society into a mature, interconnected global planetary organism.
- [54] arXiv:2506.00922 (replaced) [pdf, other]
-
Title: Integrating Emerging Technologies in Virtual Learning Environments: A Comparative Study of Perceived Needs among Open Universities in Five Southeast Asian CountriesRoberto Bacani Figueroa Jr, Mai Huong Nguyen, Aliza Ali, Lugsamee Nuamthanom Kimura, Marisa Marisa, Ami Hibatul Jameel, Luisa Almeda GelisanComments: This is the published version of a preprint I uploaded earlierSubjects: Computers and Society (cs.CY)
Amid the growing need to keep learners well-informed of the rapid technological advancements brought about by the Fourth Industrial Revolution (4IR), this study investigates the viewpoints of open university students regarding the emerging technology-based virtual learning environments for students at five prominent open universities in Southeast Asia: Hanoi Open University, Open University Malaysia, Sukhothai Thammathirat Open University, University of the Philippines Open University, and Universitas Terbuka. A survey was conducted of undergraduate students to understand their inclinations regarding the features of their virtual learning environments and how well they equip them to be productive citizens and professionals. The results highlight that the students had a significant interest in interactive books and learning analytics. The findings suggest the need to develop a roadmap for open universities to prioritize technological investments and pedagogical strategies to meet the evolving needs of their students in the digital age.
- [55] arXiv:2601.05307 (replaced) [pdf, html, other]
-
Title: The LLM Mirage: Economic Interests and the Subversion of Weaponization ControlsComments: Accepted to the ACM Conference on Fairness, Accountability, and Transparency 2026 in Montreal, CanadaSubjects: Computers and Society (cs.CY)
U.S. AI security policy is increasingly shaped by an $\textit{LLM Mirage}$, the belief that national security risks scale in proportion to the compute used to train frontier language models. That premise fails in two ways. It miscalibrates strategy because adversaries can obtain weaponizable capabilities with task-specific systems that use specialized data, algorithmic efficiency, and widely available hardware, while compute controls harden only a high-end perimeter. It also destabilizes regulation because, absent a settled definition of "AI weaponization," compute thresholds are easily renegotiated as domestic priorities shift, turning security policy into a proxy contest over industrial competitiveness. We analyze how the LLM Mirage took hold, propose an intent-and-capability definition of AI weaponization grounded in effects and international humanitarian law, and outline measurement infrastructure based on live benchmarks across the full AI Triad (data, algorithms, compute) for weaponization-relevant capabilities.
- [56] arXiv:2601.12164 (replaced) [pdf, html, other]
-
Title: The Language You Ask In: Language-Conditioned Ideological Divergence in LLM Analysis of Contested Political DocumentsSubjects: Computers and Society (cs.CY); Computation and Language (cs.CL)
Large language models are increasingly used to interpret politically contested questions, value-laden material on which there is no single correct answer, only competing interpretive traditions. We ask whether a model's choice among those traditions can turn on the language of the prompt rather than the content. Comparing two frontier models, ChatGPT 5.2 and Claude Opus 4.5, on one contested Ukrainian civil-society document under semantically matched Russian and Ukrainian prompts, we find that both shift along the same axis on identical source text: Russian prompts elicit delegitimizing readings of the document's authors and Ukrainian prompts legitimating ones. The magnitude is model-dependent but neither model is neutral: each adopts a language-dependent stance, and the difference is one of degree. Because contested political questions admit no correct reading against which to measure, we read this as language-conditioned variation in which interpretive tradition a model activates: the model neither holds a single stance nor surfaces the plurality of available ones, but silently adopts the dominant frame of the prompt's language. We draw out the consequences for pluralism-aware evaluation, which must probe the same content across the languages a model serves, and for pluralistic alignment in multilingual settings.
- [57] arXiv:2602.18431 (replaced) [pdf, html, other]
-
Title: SMaRT: Online Reusable Resource Assignment and an Application to Mediation in the Kenyan JudiciaryShafkat Farabi, Didac Marti Pinto, Wei Lu, Manuel Ramos-Maqueda, Sanmay Das, Antoine Deeb, Anja SautmannComments: Accepted for Publication at IJCAI 2026Subjects: Computers and Society (cs.CY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Motivated by the problem of assigning mediators to cases in the Kenyan judicial system, we study an online resource allocation problem where incoming tasks (cases) must be immediately assigned to available, capacity-constrained resources (mediators). The resources differ in their quality, which may need to be learned. In addition, resources can only be assigned to a subset of tasks that overlaps to varying degrees with the subset of tasks other resources can be assigned to. The objective is to maximize task completion while satisfying soft capacity constraints across all the resources. The scale of the real-world problem poses substantial challenges, since there are over 2000 mediators, and a multitude of combinations of geographic locations (87) and case types (12) that each mediator is qualified to work on. Together, these features-unknown quality of new resources (newly onboarded mediators), soft capacity constraints (due to the mandate to assign cases without delay), and high-dimensional state space-make existing scheduling and resource allocation algorithms either inapplicable or inefficient. We formalize the problem in a tractable manner, using a quadratic program formulation for assignment and a multi-agent bandit style framework for learning. We demonstrate the key properties and advantages of our new algorithm, SMaRT (Selecting Mediators that are Right for the Task), compared with baselines on some stylized instances of the mediator allocation problem. We then turn to considering its application to real-world data on cases and mediators from the Kenyan Judiciary. SMaRT outperforms baselines and allows for controlling the tradeoff between the strictness of the capacity constraints and overall case resolution rates, both in situations where mediator quality is known beforehand and when the problem is bandit-like in that learning is part of the problem definition.
- [58] arXiv:2604.01955 (replaced) [pdf, html, other]
-
Title: Teaching Students to Question the Machine: An AI Literacy Intervention Improves Students' Regulation of LLM Use in a Science TaskComments: Workshop paper accepted at ALIT4ALL 2026: 2nd International Workshop on AI Literacy Education For All, co-located with AIED 2026Subjects: Computers and Society (cs.CY)
The rapid adoption of generative artificial intelligence (GenAI) in schools raises concerns about students' uncritical reliance on its outputs. Effective use of large language models (LLMs) requires not only technical knowledge but also the ability to monitor, evaluate, and regulate one's interaction with the system, processes closely tied to metacognitive regulation. These skills are still developing in middle school, making students particularly vulnerable to over-trust and premature acceptance of AI outputs. Because classroom time and teacher training resources are constrained, there is a pressing need to develop and evaluate AI literacy interventions that can be implemented under realistic school conditions. We report a controlled classroom study examining whether a two-hour AI literacy workshop improves students' interaction strategies and quality of final answers in LLM-supported science problem solving. A total of 116 students (grades 8-9; ages 13-15) completed six science investigation tasks using a generative AI system. Two days prior, the intervention group attended the workshop, which combined information about how LLMs work and fail with practical guidance on prompting and response evaluation; the control group received no training. Trained students showed less uncritical reliance on the system: they more often reformulated queries, asked follow-up questions, and more accurately judged response correctness, leading to better performance. In contrast, GenAI and metacognitive self-report scores did not predict performance, suggesting that effective use of generative AI depends less on self-reported measures and more on explicit training in interaction regulation. Overall, the results show that brief, scalable AI literacy instruction can meaningfully improve how middle-school students use generative AI in school-like learning activities.
- [59] arXiv:2605.23922 (replaced) [pdf, html, other]
-
Title: High-Risk AI Systems and the Problem of Identity in the European AI ActComments: Accepted as a non-archival paper at The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25-28, 2026, Montreal, QC, CanadaSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The EU Artificial Intelligence Act (AIA) establishes a lifecycle governance regime for high-risk AI systems built around ex-ante conformity assessment, post-market monitoring, and re-assessment upon "substantial modification." These obligations presuppose AI identity judgments: regulators and providers must decide when an updated system remains the same system over time. In this work, we show how this logic is clarified by the function+ framework of artifact identity, which individuates AI systems by their intended function together with context-sensitive criteria of appropriate functioning, captured as "AI trustworthiness." We further argue that the AIA does not provide an internal, auditable criterion for synchronic identity--when two AI systems at a given time should count as the same for regulatory purposes--and instead largely defers such sameness determinations to sectoral or harmonization instruments. function+ supplies a synchronic identity test anchored in intended function and trustworthiness profiles and levels, making synchronic identity decisions inspectable in governance settings such as procurement, liability, and market surveillance. Our contribution is a conceptual and auditing lens: we provide a correspondence map between AIA lifecycle obligations and function+ identity components, and we make the synchronic case operationally legible via a minimal decision flow for audit and dispute contexts. We conclude with two implementation-facing recommendations: (1) more precise, testable reporting of intended purpose, and (2) standardized, auditable trustworthiness reporting that supports comparability over time and across deployments.
- [60] arXiv:2606.13474 (replaced) [pdf, html, other]
-
Title: Exploring Systems-Thinking Approaches to Loss of Control RiskComments: Accepted to the Technical AI Governance Workshop at ICML 2026Subjects: Computers and Society (cs.CY)
Internal deployment of agentic AI systems for coding and research creates a sociotechnical control problem that extends beyond model behaviour. We treat internal-deployment Loss of Control as the inability to reliably constrain, audit, reverse, or halt AI-mediated changes to code, infrastructure, evaluation, or deployment processes in time to prevent serious organisational or societal harms. We ask whether established systems-safety methods can identify risks that model-level evaluations may miss. Using a generic frontier-lab coding-agent scenario reconstructed from public materials, we apply STECA, STPA, and FRAM. The analyses surface complementary findings: published frameworks can leave governance responsibilities and feedback loops externally unverifiable; delays in monitoring and intervention can make otherwise valid control actions ineffective; and routine operational variability can gradually erode the calibration and independence of safeguards. We argue that frontier-AI risk management should pair model-focused evaluations with systems-level hazard analysis and operational assurance that tracks whether controls remain effective over time.
- [61] arXiv:2606.20605 (replaced) [pdf, other]
-
Title: Trust in Generative AI for Health Information Consumption and the Effect of Learned Dependency: An Experimental InvestigationArif Ahmed, Gondy Leroy, Agrim Sachdeva, Philip Harber, Stephen A. Rains, Seokjun Youn, Prosanta BaraiSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
Background: Generative artificial intelligence (GenAI) is increasingly used for health information, yet its influence on users' trust calibration remains unclear.
Objective: This study examines whether learned dependency on GenAI influences trust in AI-generated health information and whether text highlighting reduces overreliance on incorrect outputs.
Methods: Two randomized controlled experiments were conducted with 338 college students and 563 Amazon Mechanical Turk participants. Both experiments used a 2 by 2 between-subjects design manipulating information accuracy (correct versus incorrect) and text highlighting (highlight versus no highlight). Trust and learned dependency were measured using validated scales, and linear regression models tested main and interaction effects.
Results: In both experiments, information accuracy significantly increased trust (p < 0.001), while learned dependency was positively associated with trust (p < 0.05). The interaction between accuracy and dependency was significant (p < 0.001), indicating that highly dependent users were more likely to trust incorrect AI-generated information. Text highlighting had no significant effect on trust and did not moderate the relationship between dependency and trust.
Conclusions: Learned dependency weakens trust calibration, increasing susceptibility to inaccurate AI-generated health information. Text highlighting alone is insufficient to reduce overreliance, highlighting the need for more effective interface designs that encourage critical evaluation of GenAI outputs. - [62] arXiv:2506.12078 (replaced) [pdf, html, other]
-
Title: Modeling Earth-Scale Human-Like Societies with One Billion AgentsHaoxiang Guan, Jiyan He, Liyang Fan, Zhenzhen Ren, Shaobin He, Xin Yu, Yuan Chen, Xueyin Xu, Shuxin Zheng, Yan Gao, Enhong Chen, Tie-Yan Liu, Zhen LiuSubjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
Understanding the dynamic evolution of complex social phenomena requires both high-fidelity modeling of human behavior and large-scale simulations. Traditional agent-based models (ABMs) have been employed to study these dynamics, but are constrained by simplified agent behaviors. Recent advances in large language models (LLMs) enable agents to exhibit sophisticated social behaviors, yet face significant scaling challenges. We present Light Society, an agent-based simulation framework that advances both fronts. Light Society formalizes social processes as structured transitions of agent and environment states, governed by a set of LLM-powered simulation operations. Joint algorithmic and system optimizations, particularly a mixture-of-models engine that combines full LLMs with distilled surrogates, enable Light Society to efficiently simulate societies with over one billion agents. Grounded in real-world demographic profiles from the World Values Survey, simulations of Trust Games and opinion diffusion at up to one billion agents demonstrate Light Society's high fidelity and efficiency in modeling diverse social phenomena, providing researchers with a practical foundation for hypothesis testing and the study of emergent collective behaviors at planetary scale.
- [63] arXiv:2601.11541 (replaced) [pdf, html, other]
-
Title: A Comparative Study of Student Perspectives on Technical Writing Feedback Quality: Evaluating LLMs, SLMs, and Humans in Computer Science TopicsComments: accepted at AIED 26Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
To address the scalability of feedback in computer science while mitigating the privacy and cost limitations of commercial Large Language Models (LLMs), this study evaluates a locally hosted Small Language Model (SLM). We deployed a quantized Llama-3.1, GPT-4, and human instructors across introductory programming (N=176), operating systems (N=80), and a writing seminar (N=7). Mixed-methods analysis of student perceptions reveals that while the local SLM matched commercial LLMs and was rated higher by students for readability and actionability in technical courses, human feedback remained more favoured for highly specialized writing tasks. We demonstrate that local SLMs offer a privacy-preserving, zero-marginal-cost alternative for foundational feedback, supporting a tiered pedagogical framework where AI handles structural guidance while instructors focus on high-level conceptual scaffolding.
- [64] arXiv:2601.13903 (replaced) [pdf, html, other]
-
Title: Know Your Contract: eIDAS-Based Verifiable Legal Identities for Smart Contracts, Enabling Regulatory-Compliant On-Chain OperationsSubjects: Cryptography and Security (cs.CR); Computers and Society (cs.CY); Distributed, Parallel, and Cluster Computing (cs.DC)
Public blockchains provide no native mechanism to verify the legal identity behind a deployed smart contract, which blocks institutional adoption and compliance with EU regulations such as MiCA and AMLR. We present KYC Seal, the first protocol that extends the EU eIDAS trust infrastructure to Ethereum smart contracts by cryptographically binding them to Qualified Electronic Seals issued by Qualified Trust Service Providers (QTSPs). The protocol realizes the full eIDAS trust chain, from the European Commission's List of Trusted Lists through Member-State trusted lists and QTSP-signed X.509 certificates down to the individual smart contract, natively on-chain. An on-chain parser extracts identity fields directly from the QTSP-signed certificate bytes at registration. Both cryptographic verifications, the QTSP issuance signature and the certificate holder's seal signature, are performed once at registration and cached as on-chain state, reducing per-interaction seal verification to a pure state check. A new P-256 elliptic-curve precompile in Ethereum (deployed December 2025) makes these one-time cryptographic steps economical, enabling trustless on-chain verification of eIDAS identities without oracles or runtime intermediaries. A reference implementation, a formal security analysis, and a gas evaluation are the subject of forthcoming work.
- [65] arXiv:2601.17146 (replaced) [pdf, html, other]
-
Title: Falsifying Discriminant Validity of Predictive AlgorithmsJournal-ref: Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), 3105--3128, 2026Subjects: Methodology (stat.ME); Computers and Society (cs.CY); Machine Learning (cs.LG); Machine Learning (stat.ML)
Empirical investigations into unintended model behavior often show that the algorithm is predicting another outcome than what was intended. These exposés highlight the need to identify when algorithms predict unintended quantities - ideally before deploying them into consequential settings. We propose a falsification framework that provides a principled statistical test for discriminant validity: the requirement that an algorithm predict intended outcomes better than impermissible ones. Drawing on falsification practices from causal inference, econometrics, and psychometrics, our framework compares calibrated prediction losses across outcomes to assess whether the algorithm exhibits discriminant validity with respect to a specified impermissible proxy. In settings where the target outcome is difficult to observe, multiple permissible proxy outcomes may be available; our framework accommodates both this setting and the case with a single permissible proxy. Throughout we use nonparametric hypothesis testing methods that make minimal assumptions on the data-generating process. We illustrate the method in an admissions setting, where the framework establishes discriminant validity with respect to gender but fails to establish discriminant validity with respect to race. This demonstrates how falsification can serve as an early validity check. We also provide analysis in a criminal justice setting, where we highlight the limitations of our framework and emphasize the need for complementary approaches to assess other aspects of construct validity and external validity.
- [66] arXiv:2604.01495 (replaced) [pdf, html, other]
-
Title: The Weak Signal Cultivation Model: A Human-Centric Framework for Frontline Risk Detection, Signal Tracking, and Proactive Organizational ResilienceComments: 23 pages, 2 figures, 8 tables, 15 equations, white paperSubjects: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
This white paper introduces the Weak Signal Cultivation Model (WSCM). WSCM is a human-centric framework for detecting, structuring, and tracking weak risk signals as observed by frontline staff. The model centers on a continuous [0,10] x [0,10] coordinate field--the Weak Signal Cultivation Field, in which each identified signal is positioned as a node on two independent dimensions: its current Risk Intensity (x) and its Risk Growth Potential (y). Represented as a risk locus, nodes move across the field over time as new team assessments or measurements arrive. The locus reflects the signal's trajectory across four possible regions: Question Marks, Lit Fuses, Sleeping Cats, and Owls. Through this graphical approach, bridging risk communication from the frontline experience to management decision-making is made through a single organizational vocabulary. The model introduced in this document is designed to serve as a practitioner tool and a conceptual foundation for AI-supported analytics.
- [67] arXiv:2605.08157 (replaced) [pdf, other]
-
Title: Clinical Feasibility of Smartphone-based EEG in KenyaWilliam Lehn-Schiøler, Nomin Enkhtsetseg, Anton Mosquera Storgaard, Magnus Guldberg Pedersen, Dylan Rice, George Wambugu, Nshimiyimana Jules Fidele, Melita Cacic Hribljan, Anca Alina Arbune, Sidsel Armand Larsen, Sandor BeniczkyComments: 17 pages, 5 figures, 1 tableSubjects: Signal Processing (eess.SP); Computers and Society (cs.CY)
Purpose: Access to electroencephalography (EEG) remains limited across low- and middle-income countries (LMICs) due to cost, infrastructure requirements, and a shortage of trained staff. This study evaluated the feasibility and clinical utility of a smartphone-based EEG system in a real-world setting.
Methods: We conducted a multicenter observational study (November 2023 to April 2026) across 29 clinical sites in Kenya. A smartphone-based 27-lead EEG system enabled trained healthcare workers to acquire standardized recordings with remote expert interpretation.
Results: 3,036 EEG sessions were performed. Male patients constituted 57.8% of the cohort, with representation across pediatric and adult populations. The most common referral indication was seizures or convulsions (68.5%). Overall, 2,915 (96%) recordings were interpretable, while 121 (4%) were uninterpretable, primarily due to high electrode impedance and insufficient recording duration. Uninterpretable recordings were significantly shorter than interpretable recordings (mean 18.5 vs. 33.8 minutes; median 15.1 vs. 31.6 minutes; p < 0.0001). Mean turnaround time for interpretation was 107 minutes.
Among interpretable recordings, 917 (30.2%) were abnormal, including 701 (76.4%) with epileptiform abnormalities, 215 (23.4%) with non-epileptiform findings, and 1 (0.1%) indeterminate finding. Epileptiform abnormalities were highest in children aged 4-9 years (33.1%) and less frequent in adults (14-21%). Non-epileptiform abnormalities were more common in patients aged 60+ years (19.2%) compared to younger age groups (3-9%).
Conclusion: Large-scale, point-of-care EEG acquisition by non-specialist operators in a resource-limited setting is feasible. Expansion of smartphone-based EEG systems may improve equitable access to neurological diagnosis and care in LMICs. - [68] arXiv:2606.02980 (replaced) [pdf, html, other]
-
Title: A Training-Efficient Transformer-Based Anti-Spoofing Network for Logical Access in ASVspoof 5Comments: 11 pages, 2 figuresSubjects: Sound (cs.SD); Computers and Society (cs.CY)
Synthetic and manipulated speech can reduce the reliability of automatic speaker verification systems, so anti-spoofing methods need to be both accurate and efficient in training and inference. This paper focuses on the ASVspoof 5 Track 1 closed condition, where standard cross-entropy training may not give enough attention to hard trials and is not directly aligned with ranking- and threshold-based evaluation metrics. We propose TFPARN, a Transformer-based focal-pairwise attentive ranking network. The system extracts log-Mel features from speech, uses a Transformer encoder to model frame-level information, applies attention pooling to obtain utterance-level representations, and is trained with a combination of focal classification loss and pairwise ranking loss. RawBoost augmentation is used during training, and test-time augmentation is applied during evaluation to improve robustness. Compared with re-implemented AASIST and RawNet2 baselines under the same protocol, TFPARN achieves the best results, with a minDCF of 0.2430 and an EER of 12.52%. Ablation experiments further show that the pairwise loss, focal loss, and attention pooling all improve performance. TFPARN also uses the lowest inference memory among the compared systems, at 1.4 GB, runs at about 0.79 ms per utterance, and reaches its best checkpoint in less training time than AASIST. These results show that TFPARN provides a good balance between detection accuracy and computational cost for logical access anti-spoofing.
- [69] arXiv:2606.19263 (replaced) [pdf, html, other]
-
Title: Digital Speech Acts Retain Control of Copyright with People, Not PlatformsSubjects: Social and Information Networks (cs.SI); Computers and Society (cs.CY); Multiagent Systems (cs.MA); General Economics (econ.GN)
Legal precedents protect computer code as copyrightable expression. They have enabled centralized digital platforms -- operating from corporate servers that hold all user data -- to construct private governance regimes through the interaction of copyright, contract, and technical architecture: people who create virtually all platform value must surrender effective copyright control through Terms of Service agreements as a condition of participation.
In contrast, grassroots platforms consist of cryptographically-identified people operating their networked smartphones independently of any server or global resource; each person holds their own data on their own device, with no third party in possession or intermediation. Here, we define the notion of a digital speech act -- a deliberate volitional act by a person of cryptographically signing personal content with the person's private key, carried out on the person's own device -- through which the person simultaneously establishes attribution, accountability, and authorship over the signed content. We contend that (i) digital speech acts qualify for copyright protection under existing U.S. precedent: Burrow-Giles locates authorship in volitional creative choices despite mechanical or algorithmic processes, Feist supplies the minimal-creativity threshold, and persistent device storage satisfies the Copyright Act's fixation requirement; (ii) the digital social contract underlying grassroots platforms preserves this copyright by design -- signed content cannot be unbundled from its signature, and the full provenance chain accumulates as content is forwarded -- so that copyright ownership and physical possession of authenticated digital expressions coalesce in the person; and (iii) this coalescence of legal ownership and physical possession provides the foundations for digital sovereignty and democratic self-governance.