1. Introduction
A city is not a shopping mall scaled up. A mall has dozens of cameras, a single operator room, a known floor plan, and a bounded perimeter. A city has hundreds of thousands of cameras belonging to different agencies, operated by different software stacks, with no shared clock, no shared coordinate system, and no shared identity of the people moving through it. The threat that unfolds across a mall in minutes may unfold across a city in hours.
Prior work on multi-camera threat detection [15] addressed the single-site problem: how to aggregate threat signals across a bounded, topologically static network of uncalibrated cameras. That framework produced a hierarchical pipeline combining SlowFast [1] for local action recognition, VideoMAE v2 [3] for view-invariant re-identification embeddings, and a graph-transformer for trajectory-level anomaly scoring decomposed into three components: local, spatial, and temporal. The architecture was conceptually sound for its scope. But city-scale surveillance breaks every assumption that made the single-site problem tractable.
Four assumptions fail in particular. First, topology is no longer static: a camera network covering a city changes continuously as mobile units — drones, vehicle-mounted cameras, PTZ units with shifting fields of view — enter and leave the observable space. Second, processing cannot be centralised: the bandwidth required to pipe raw video from thousands of cameras to a single inference server is physically infeasible at urban scale, necessitating a distributed, federated architecture. Third, video is no longer the only modality: urban security infrastructure routinely includes acoustic sensors, license plate recognition systems, access-control logs, and cellular positioning data, all of which carry independent threat signals that a video-only architecture cannot exploit. Fourth, and distinctly, cameras do not belong to a single administrative domain: in practice, urban surveillance infrastructure spans municipal police networks, transit authorities, private operators, and federal installations, each governed by different data-sharing regulations that legally prohibit raw video from crossing jurisdictional boundaries. This fourth failure mode — administrative fragmentation — is not a technical limitation but a legal and organisational one, yet it is equally capable of rendering a city-scale architecture inoperable if unaddressed.
This paper makes four contributions. First, it formulates city-scale calibration-free threat trajectory aggregation as a distinct and unsolved problem, explicitly identifying the four failure modes above as the gap between single-site capabilities and urban-scale requirements. Second, it proposes a federated hierarchical framework that addresses each failure mode with a dedicated architectural component: a distributed edge-inference layer, a dynamic topology graph, a cross-modal fusion stage, and a cross-jurisdiction propagation mechanism. Third, it extends the anomaly decomposition of prior work [15] with a fourth component — network-level anomaly — that captures coordinated multi-actor threats whose individual trajectories may each appear innocuous. Fourth, it strengthens the theoretical basis of the network-level anomaly component by distinguishing coordinated threat movement from coincidental convergence in dense urban environments.
The paper is organised as follows. Section 2 reviews the relevant architectural building blocks and their limitations at city scale. Section 3 formalises the problem. Section 4 describes the proposed framework. Section 5 illustrates it on three urban threat scenarios. Section 6 discusses limitations and open questions.
2. Architectural Building Blocks: Capabilities and City-Scale Limits
The core video understanding components carry over from prior single-site work [15] with the same capabilities and the same fundamental limitation: all operate on short clips from a single camera and produce no inter-node representations. SlowFast R50 [1] achieves 77.0 % top-1 on Kinetics-400 and remains the practical choice for real-time per-camera action recognition. Video Swin-B [2] reaches 84.9 % and is preferable in dense scenes requiring precise interaction localisation. VideoMAE v2 ViT-g [3] achieves 90.0 % top-1 and produces view-invariant embeddings by virtue of self-supervised pre-training on masked reconstruction — the property exploited for calibration-free re-identification.
At city scale, four additional components become load-bearing. Federated learning frameworks such as FedAvg [16] demonstrate that model aggregation across geographically distributed nodes can maintain accuracy close to centralised training while eliminating the need to transmit raw data. This property is essential for urban surveillance, where data sovereignty constraints may prevent raw video from leaving the network segment of its originating agency. Federated inference — applying a trained model at edge nodes and transmitting only compressed representations to a central aggregator — is the operational counterpart to federated training.
Dynamic graph neural networks [17] extend static graph transformers to settings where node and edge sets evolve over time. In the urban surveillance context, nodes represent active camera feeds and mobile sensor units; edges represent observed or estimated transition probabilities between coverage zones. The ability to add and remove nodes without retraining the entire aggregation model is a prerequisite for integrating drone feeds and PTZ units whose coverage zones shift continuously.
Cross-modal fusion has been studied extensively in audio-visual settings [18], but the specific problem of fusing heterogeneous urban sensor streams — video, acoustic, license plate, access-control — without metric calibration between modalities has received little attention. The absence of a shared coordinate system means that traditional sensor fusion methods requiring explicit spatial alignment are inapplicable; instead, temporal co-occurrence and semantic compatibility must serve as the alignment signal.
Cross-jurisdiction data propagation is a component with no direct precedent in single-site or multi-camera surveillance literature. The requirement is to propagate threat-relevant embeddings across administrative domain boundaries without transmitting personally identifiable raw data. Federated identity propagation — transmitting anonymised appearance embeddings rather than video — is the mechanism proposed here, and it requires explicit design at both the technical and governance levels. Table 1 summarises the capabilities of the main components with respect to the city-scale threat detection task/
Table 1
Architectural components and their city-scale threat detection capabilities. Dashes indicate components not evaluated on standard benchmarks as standalone models
|
Architecture |
Temporal Horizon |
Top-1 K-400 |
Re-ID Robustness |
Multi-Camera Aggregation |
|
SlowFast R50 [1] |
Seconds |
77.0 % |
Low |
No |
|
Video Swin-B [2] |
Seconds–minutes |
84.9 % |
Medium |
No |
|
VideoMAE v2 ViT-g [3] |
Up to tens of seconds |
90.0 % |
High (by design) |
No |
|
Federated Node (proposed) |
Minutes–hours |
— |
High (via VideoMAE) |
Yes, per node |
|
Proposed Full Architecture |
Minutes–hours |
— |
High |
Yes, city-scale |
3. Problem Formalisation
3.1. Scope and Definitions
Let an urban surveillance network consist of N sensor nodes S₁, S₂,..., SN, where each node is either a fixed camera, a PTZ unit, an aerial drone, or an auxiliary non-video sensor. Nodes are heterogeneous: they differ in resolution, frame rate, field of view, and modality. No two nodes share a common calibration; temporal synchronisation is approximate, with clock drift on the order of seconds between nodes belonging to different administrative domains.
Each video node produces a stream of person tracks, and each track is encoded by a local inference node into a compact multi-modal embedding: a concatenation of the action embedding from SlowFast and the appearance embedding from VideoMAE v2, as in prior work [15]. Non-video nodes produce timestamped event records — acoustic detections, license plate reads, door-entry logs — which are independently embedded by modality-specific encoders.
The central problem is: given the stream of embeddings and event records from all N nodes, construct cross-node person trajectories spanning time horizons of one minute to several hours, and assign to each trajectory an anomaly score sufficient to distinguish pre-attack surveillance, coordinated infiltration, and organised criminal activity from the normal movement patterns of an urban population.
3.2. Extended Anomaly Decomposition
Prior work [15] decomposed trajectory anomaly into three components: local anomaly Alocal (unusual actions in individual clips), spatial anomaly Aspatial (atypical routing between zones), and temporal anomaly Atemporal (unusual dwell times and trajectory durations). This decomposition is preserved and extended with a fourth component.
Network-level anomaly Anetwork captures the statistical unusualness of the relationships between multiple trajectories observed simultaneously. The key design challenge is distinguishing coordinated movement — which constitutes a genuine threat signal — from coincidental convergence, which is a routine feature of dense urban environments. In a busy transit hub at rush hour, for example, hundreds of independent trajectories will converge on the same platform within a narrow time window without any coordination whatsoever. A naive implementation of Anetwork would flag such events continuously, rendering it operationally useless.
The resolution of this challenge lies in the joint distribution of spatial origin diversity and temporal synchronisation. Coincidental convergence in dense environments typically involves trajectories originating from a small number of adjacent source zones — commuters from nearby platforms, shoppers from nearby entrances — because normal crowd flow is directionally structured. Coordinated multi-actor threats, by contrast, tend to exhibit a statistically distinctive pattern: high spatial diversity of origin points combined with anomalously tight temporal synchronisation. Five individuals arriving at a restricted zone from five different districts within a four-minute window is not the signature of coincidental convergence; it is the signature of a planned rendezvous. Anetwork is therefore computed as a function of two sub-components: origin-diversity score Dorigin, which measures the geographic spread of trajectory source zones over a configurable lookback window, and synchronisation score Ssync, which measures the tightness of the time distribution of arrivals relative to the expected distribution under independent random-walk behaviour. Formally:
A_network = f(D_origin, S_sync, C_cluster)
where Ccluster is a cluster coherence term that penalises trajectory sets whose pairwise appearance embeddings are too similar — a signal of pre-planned uniform appearance — and f is a learned combination function. This formulation suppresses false positives in high-density environments while remaining sensitive to coordinated threats whose signatures emerge precisely from the combination of spatial diversity and temporal precision that coincidental convergence does not produce.
The extended anomaly score over the full trajectory set Π is:
A(Π) = w₁ · A_local + w₂ · A_spatial + w₃ · A_temporal + w₄ · A_network, Σwᵢ = 1
Weights are scenario-dependent: in a high-density public event, A_network may dominate; in a restricted-access facility, A_spatial will carry the greater share.
4. Conceptual Architecture
4.1. Overview
The proposed framework is a five-level hierarchy. Levels 1 and 2 replicate the detection, tracking, and local representation components of the single-site architecture [15] and run independently at each edge node. Levels 3 through 5 are new and address the four failure modes identified in Section 1: distributed aggregation (Level 3), dynamic topology management (Level 4), and cross-modal fusion with city-level threat scoring including cross-jurisdiction propagation (Level 5).
4.2. Level 1 — Distributed Detection and Tracking
Each video node runs YOLOv8 [10] and ByteTrack [11] locally, producing per-camera person tracks as in prior work. The critical constraint at city scale is that no raw video leaves the edge node: only track bounding boxes and their timestamps are forwarded to the Level 2 encoder. This constraint satisfies both the bandwidth limitations of urban networks and the data minimisation requirements of privacy regulation, and is a prerequisite for compliance with the administrative boundary constraints discussed in Section 1.
4.3. Level 2 — Federated Local Representation
At each edge node, SlowFast R50 and a lightweight VideoMAE v2 variant (ViT-B for computational feasibility) encode each active track into an action embedding and an appearance embedding respectively. These are concatenated into a local track embedding, which is the unit of data transmitted to the central aggregator. Transmission volume scales with the number of active tracks rather than with video resolution or frame rate, making the approach bandwidth-feasible even at large node counts.
The reconstruction error of VideoMAE v2, used in prior work as a local anomaly signal, is computed locally and transmitted as a scalar alongside the embedding. This preserves the interpretability of the local anomaly component without transmitting the full reconstruction.
Federated model updates allow edge nodes to contribute to improving the shared encoder without sharing raw video. Following the FedAvg protocol [16], each node periodically transmits model weight gradients to the central server, which aggregates them and returns an updated model. This is particularly important for handling the distribution shift between different urban zones — commercial districts, transit hubs, residential areas — each of which has a distinct normal behaviour distribution.
4.4. Level 3 — Calibration-Free Cross-Node Identity Matching
Cross-node identity matching follows the soft-assignment principle of prior work [15]: for each pair of tracks from different nodes, a match probability is computed as a function of embedding similarity and temporal feasibility. At city scale, the temporal feasibility constraint is replaced by a travel-time distribution derived from historical transition data between coverage zones — a function of both geographic distance and time of day, since urban travel times vary significantly with congestion.
A critical extension at city scale is the integration of auxiliary modality signals into the matching decision. A license plate read at a parking structure node and a person detection at the adjacent building entrance, co-occurring within a feasible time window, constitute corroborating evidence for a shared identity even if their video embeddings are not directly comparable. The cross-modal fusion mechanism described in Level 5 feeds back into identity matching at this level, creating a bidirectional dependency between the two.
4.5. Level 4 — Dynamic Topology Graph
The static topology graph of single-site surveillance — a fixed adjacency matrix representing typical transitions between camera zones — is insufficient at city scale for two reasons. First, mobile nodes such as drones and PTZ units continuously change their coverage zones, altering the graph structure in real time. Second, scheduled events — public gatherings, transportation disruptions, large-scale operations — predictably alter normal transition patterns, making historical statistics temporarily invalid.
The dynamic topology graph maintains two layers. The physical layer encodes geographic proximity and known connectivity between node coverage zones; this layer changes only when infrastructure changes. The statistical layer encodes current transition probabilities estimated from a rolling window of recent observations; this layer updates continuously and is weighted to down-weight historical data during detected anomalous periods, preventing the spatial anomaly component from being calibrated against already-anomalous baselines.
Dynamic graph neural networks [17] operating over this two-layer structure can propagate threat signals spatially: a detection at one node raises the anomaly prior for adjacent nodes, enabling the system to anticipate trajectory continuations rather than merely react to them.
4.6. Level 5 — Cross-Modal Fusion, Cross-Jurisdiction Propagation, and City-Level Threat Scoring
The top level of the framework aggregates all signals — video track embeddings from Level 2, identity match probabilities from Level 3, graph-propagated anomaly priors from Level 4, and event records from auxiliary sensors — into a unified threat assessment. It additionally manages the propagation of threat context across administrative domain boundaries.
Cross-modal alignment without metric calibration is achieved through temporal and semantic co-occurrence. Two events from different modalities are treated as potentially co-referential if they occur within a configurable time window and are associated with the same coverage zone. A transformer-based fusion encoder [18] takes the concatenated modality-specific embeddings of co-referential events and produces a joint representation. This architecture is preferred over simple late fusion — which treats modality outputs as independent inputs to a final classifier — because it allows inter-modal attention: the transformer can learn that an acoustic detection and a video detection at the same location are mutually reinforcing evidence, rather than treating them as additively independent. It is also preferred over attention bottlenecks [18] in their strict form because the absence of metric calibration between modalities means bottleneck dimensionality cannot be set by a principled alignment criterion; the full cross-attention formulation is more robust to this uncertainty. The attention mechanism naturally weights modalities by their informativeness: in a low-visibility environment, acoustic embeddings will receive higher attention weights than video embeddings of comparable or lower quality.
The network-level anomaly component Anetwork is computed at this level over the full set of active trajectory graphs. A graph-transformer with global attention tokens — following the Longformer architecture [12] applied to graph inputs — computes pairwise trajectory correlations and scores joint movement patterns against the baseline described in Section 3.2. The computational complexity of full pairwise attention is managed by restricting cross-trajectory attention to trajectories sharing at least one coverage zone within the recent time window.
Cross-jurisdiction threat propagation operates as follows. When a trajectory in administrative domain A receives an elevated anomaly score, the corresponding VideoMAE v2 appearance embedding — not the underlying video — is transmitted to the central aggregator with a threat-level tag. When a track in domain B subsequently produces an embedding that exceeds the match threshold against the stored threat embedding, the prior anomaly score from domain A is incorporated into the ongoing trajectory score in domain B. No raw video crosses the administrative boundary at any point. The mechanism is the technical implementation of the fourth architectural component identified in Section 1, and its privacy-preserving properties are a design requirement rather than an incidental feature.
The final output is the extended anomaly score A(Π) computed per trajectory and per trajectory cluster, with dominant component attribution enabling interpretable operator alerts. The city-level dashboard presents active high-anomaly trajectories on a map overlay of the dynamic topology graph, with colour coding by dominant anomaly type.
5. Illustrative Urban Scenarios
Note: the following scenarios are conceptual illustrations of the expected behaviour of the proposed framework, not results of experiments or simulations. Their purpose is to demonstrate which anomaly components would be activated in each case and why single-site analysis would be insufficient.
5.1. Scenario 1: Pre-Attack Urban Reconnaissance
An individual visits a transit hub three times over two hours, each time spending several minutes observing staff positions and access points before departing. Between visits, the same individual is detected by a license plate reader at a nearby parking structure and briefly appears on a drone feed covering the street approach. No single camera observes more than a three-minute fragment of this activity.
The temporal anomaly component accumulates across the three visits: the total dwell time in the hub far exceeds the statistical norm for the zone. The spatial anomaly component activates when the trajectory reconstructed from the three video fragments and the license plate record forms a circular pattern around the hub — a route type associated with low probability in the statistical topology layer. The cross-modal fusion at Level 5 increases the identity match confidence for the three video fragments by using the license plate read as a temporal anchor, despite the absence of direct camera-to-camera handoffs. Anetwork remains near baseline in this scenario since only a single actor is involved; the elevated composite score is driven primarily by Atemporal and Aspatial.
5.2. Scenario 2: Coordinated Multi-Actor Infiltration
Five individuals enter a government district through five different access points over a thirty-minute window. Each individual's trajectory is individually unremarkable: the spatial routes are all within normal parameters, and no action recognition system flags unusual behaviour. After forty minutes, all five are detected within a single restricted zone.
Individual anomaly scores remain below alert threshold throughout. The network-level anomaly component is the decisive signal here. The five trajectories exhibit high spatial origin diversity — five different district entry points — combined with a tight arrival time distribution at the restricted zone that falls well below the expected variance under independent random-walk behaviour. The Dorigin sub-component scores high because no normal crowd dynamic produces simultaneous convergence from five geographically separated origins onto a single low-traffic node. The Ssync sub-component scores high because the arrival times cluster within four minutes despite origin distances that would, under normal transit patterns, produce a much wider spread. This joint pattern is the signature of coordinated movement, not coincidental convergence. The dynamic topology graph propagates elevated anomaly priors to the restricted zone from the moment the fourth individual's trajectory is linked to the emerging cluster pattern, enabling a pre-convergence alert rather than a post-incident detection.
5.3. Scenario 3: Cross-Jurisdiction Threat Trajectory
A person of interest is observed in an aggressive confrontation on a camera belonging to municipal authority A. The individual then moves into a district covered by cameras belonging to municipal authority B, where no prior context is available. Standard single-site systems operating within authority B's network have no basis for elevated alerting. This is precisely the fourth failure mode identified in Section 1: the administrative boundary between the two authorities prevents context from propagating through conventional means.
The federated architecture propagates the threat embedding of the identified individual across the administrative boundary via the central aggregator, without transmitting raw video between jurisdictions. When the VideoMAE v2 appearance embedding of a track in authority B's network exceeds the match threshold against the stored threat embedding, the prior local anomaly score from authority A's observation is incorporated into the ongoing trajectory score. The cross-jurisdiction handoff is seamless from the operator's perspective while remaining compliant with data-sharing restrictions between administrative domains.
6. Limitations and Open Questions
6.1. Conceptual Status of the Framework
As with the prior single-site work [15], this paper presents an architectural proposal without experimental validation. Hyperparameter choices — fusion window sizes, match probability thresholds, Anetwork baseline parameters, federated aggregation frequency — are unspecified and will require systematic empirical investigation.
6.2. Absence of City-Scale Multi-Modal Threat Trajectory Benchmarks
No public dataset provides annotated multi-camera threat trajectories at urban scale across heterogeneous sensor modalities. The construction of such a dataset — whether from synthetic city simulators, privacy-anonymised real data, or a combination — is a prerequisite for empirical evaluation and represents a significant community effort in its own right.
6.3. Scalability of the Dynamic Topology Graph
The proposed two-layer graph must update continuously as mobile nodes enter and leave the network. At N = 10,000 nodes — a plausible figure for a medium-sized city — the computational cost of graph neural network inference over the full topology requires careful engineering. Hierarchical graph partitioning, where city districts form super-nodes with internal topology abstracted, is one candidate solution; its impact on anomaly detection accuracy is unknown.
6.4. Re-Identification Across Extreme Appearance Change
The concern raised in prior work [15] — that clothing change defeats appearance-based Re-ID — is amplified at city scale, where trajectories span hours rather than minutes. Biomechanical embeddings based on skeletal pose estimation [13, 14] remain the most promising mitigation, but their effectiveness over multi-hour gaps with significant variation in walking surface, clothing weight, and physical exertion has not been demonstrated.
6.5. Cross-Modal Fusion Reliability
The temporal co-occurrence alignment proposed for heterogeneous sensor fusion is sensitive to clock drift between administrative domains. A GPS-anchored time synchronisation protocol is assumed but not specified. In its absence, a tolerance model that degrades match confidence gracefully with increasing temporal uncertainty would be required.
6.6. Ethical and Legal Dimensions at Government Scale
City-scale surveillance using AI trajectory aggregation raises concerns that exceed those of single-site deployment. Tracking individuals across an entire city over hours approaches the capability of mass surveillance, with well-documented potential for misuse [19]. Architectural privacy controls — on-device processing of raw video, transmission of embeddings rather than biometric data, mandatory retention limits on trajectory records — are not optional features but design requirements. The cross-jurisdiction propagation mechanism specifically requires the definition of bilateral data governance agreements specifying what constitutes permissible transmitted information, retention limits on threat embeddings, and oversight mechanisms for cross-domain matching. The framework as described requires the definition of explicit data governance policies at each level of the hierarchy before any real-world deployment could be considered responsible.
7. Conclusion
City-scale calibration-free threat trajectory aggregation is a problem of qualitatively greater difficulty than its single-site counterpart. The transition from a bounded camera network to a heterogeneous urban sensor ecosystem invalidates four core assumptions of single-site architectures: static topology, centralised processing, video as the sole modality, and membership within a single administrative domain. This paper has identified these failure modes, proposed a five-level federated framework to address them, and extended the three-component anomaly decomposition of prior work with a fourth network-level component for coordinated multi-actor threat detection.
The primary conceptual contributions are the dynamic topology graph, which replaces static camera adjacency maps with a live two-layer structure capable of incorporating mobile sensing units; the federated inference architecture, which enables city-scale deployment within realistic bandwidth and data-sovereignty constraints; the cross-modal fusion mechanism, which integrates video, acoustic, license plate, and access-control signals without requiring metric inter-modal calibration through a transformer-based architecture preferred over late fusion and strict attention bottlenecks for its robustness to unaligned modalities; the cross-jurisdiction propagation mechanism, which enables threat context to traverse administrative boundaries via embedding transmission rather than raw data sharing; and the strengthened formulation of Anetwork, which distinguishes coordinated multi-actor threats from coincidental convergence by jointly measuring spatial origin diversity and temporal synchronisation tightness.
Three urban threat scenarios illustrate that the framework would, conceptually, detect threats invisible to single-site analysis: multi-hour reconnaissance trajectories assembled from cross-jurisdictional fragments, coordinated infiltrations detectable only through network-level anomaly scoring, and cross-domain handoffs enabled by federated embedding propagation. Experimental validation of these claims, construction of a city-scale multi-modal threat trajectory benchmark, and the development of appropriate data governance frameworks are identified as the primary directions for future work.
References:
- Feichtenhofer C., Fan H., Malik J., He K. SlowFast networks for video recognition // IEEE ICCV. — 2019. — P. 6202–6211.
- Liu Z., Ning J., Cao Y. et al. Video Swin Transformer // CVPR. — 2022. — P. 3202–3211.
- Wang L., Huang B., Zhao Z. et al. VideoMAE V2: Scaling video masked autoencoders with dual masking // CVPR. — 2023. — P. 14549–14560.
- Wojke N., Bewley A., Paulus D. Simple online and realtime tracking with a deep association metric // ICIP. — 2017. — P. 3645–3649.
- Ye M., Shen J., Lin G. et al. Deep learning for person re-identification: A survey and outlook // IEEE TPAMI. — 2022. — Vol. 44(6). — P. 2872–2893.
- He L., Wang Y., Liu W. et al. Foreground-aware Pyramid Reconstruction for alignment-free occluded person re-identification // ICCV. — 2019. — P. 8450–8459.
- Ristani E., Solera F., Zou R. et al. Performance measures and a data set for multi-target, multi-camera tracking // ECCV. — 2016. — P. 17–35.
- Sultani W., Chen C., Shah M. Real-world anomaly detection in surveillance videos // CVPR. — 2018. — P. 6479–6488.
- Tian Y., Pang G., Chen Y. et al. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning // ICCV. — 2021. — P. 4975–4984.
- Jocher G. et al. Ultralytics YOLOv8. — 2023. — URL: https://github.com/ultralytics/ultralytics.
- Zhang Y., Sun P., Jiang Y. et al. ByteTrack: Multi-object tracking by associating every detection box // ECCV. — 2022. — P. 1–21.
- Beltagy I., Peters M. E., Cohan A. Longformer: The long-document transformer // arXiv:2004.05150. — 2020.
- Xu Y., Zhang J., Zhang Q., Tao D. ViTPose: Simple vision transformer baselines for human pose estimation // NeurIPS. — 2022. — Vol. 35.
- Starodubtsev I. S. Models, algorithms and software complex for building natural human-computer interaction based on gestures: PhD thesis. — 2024.
- Адда-Аббу, А.-Р. Многокамерное слияние без калибровки для обнаружения угроз в системах видеонаблюдения: концептуальная архитектура на основе глубокого обучения / А.-Р. Адда-Аббу // Молодой ученый. — 2026. — № 14 (617). — С. 3–8. — URL: https://moluch.ru/archive/617/134970.
- McMahan B., Moore E., Ramage D. et al. Communication-efficient learning of deep networks from decentralized data // AISTATS. — 2017. — P. 1273–1282.
- Kazemi S. M., Goel R., Jain K. et al. Representation learning for dynamic graphs: A survey // JMLR. — 2020. — Vol. 21(70). — P. 1–73.
- Nagrani A., Yang S., Arnab A. et al. Attention bottlenecks for multimodal fusion // NeurIPS. — 2021. — Vol. 34.
- Zuboff S. The Age of Surveillance Capitalism. — New York: PublicAffairs, 2019.

