|
follow us
|
 |
 |
Happening now...







|
 |
|
|
Research Prototypes |
 |
Autonomic & Grid Computing Group (AGC)
Research Prototypes
All the following systems are readily available at the group’s laboratories. Most have been also demonstrated in public events, such as:
- COMDEX Greece, Athens, November, 18-20, 2005.
- CHIL Technology Day, Germany, Berlin, April 27th, 2006.
- EC IST Event, Finland, Helsinki, November, 2006.
- Athens Digital Week, Greece, Athens, October 16-20, 2008.
Face Detection
The face detector used in this paper is of the boosted cascade of simple classifiers type. Its implementation in OpenCV is chosen, as this is publicly available. We train a frontal detector using 9,000 positive samples (images of faces cropped from the development and evaluation sets of the CLEAR2006 dataset), 18,000 negative samples (images with no human or animal face present), all of them scaled to 12 pixels wide and 16 high (aspect ratio of 3/4), minimum feature size 0, 99.9% hit rate and 50% false alarm per cascade stage, horizontal and 45-degrees tilted haar-like features, non-symmetric faces, four splits and gentle AdaBoost learning.
Since the face detectors have to cope with a wide range of poses, expressions and illuminations, still providing acceptable detection rate, they suffer from false detections. False detections can be reduced to the expense of hit rate by increasing the stages in the detector cascades, as shown in the Receiver Operating Characteristic (ROC) curves of the following Figure. Larger scales at which the images are processed allow smaller faces to be detected but also increase the false positives. Note that both increasing the number of stages and the scale of the images requires more processing time per image. We use a more efficient approach; false positives are constrained by validating the detections as actually being faces. Any successful validation scheme should produce points on the hit vs. false positive rate plane to the upper-left of the intersection of the stage and scale ROC curves. We use color and 3D position as it can be estimated from a single calibrated camera for face validation.
2D Body/Face Tracking
The system tracks the bodies and/or the faces of people being monitored by a single camera. The system attempts to overcome difficulties imposed by the environment:
- Clutter generated by multiple people and complex and moving background
- Lighting
- Pose relative to the cameras
The system operates on far field recordings, with a VGA (or higher) resolution camera mounted on a corner of a room, on the ceiling, or in a surveillance setup.
Body Tracking
The goal of the body tracker is to provide the frame regions occupied by human bodies. Any subsequent face detection and tracking is performed within these body regions. The tracker is based on a dynamic foreground segmentation algorithm that utilizes adaptive background modeling with learning rates spatiotemporally controlled by the states of a Kalman filter. It comprises three modules in a feedback configuration: adaptive background modeling based on Stauffer’s algorithm provides the pixels that are considered foreground to the evidence formation module. The later combines the pixels into body evidence blobs, used for the measurement update state of the Kalman filter module. The states of the Kalman filter are used to obtain an indication of the mobility of each target, as a combination of translation motion and its size change. Also the position and size of the targets are contained in the states of the Kalman filter. This information is fed back to the adaptive background modeling module to adapt the learning rate in the vicinity of the targets: frame regions that at a specific time have a slow-moving target have smaller learning rates.
The proposed spatiotemporal adaptation of the learning rate of the adaptive background modeling module solves the problem of Stauffer’s algorithm when foreground objects stop moving. Without it, targets that have stopped moving are learnt into the background. With the proposed feedback configuration this process is halted long enough for the intended application, i.e. tracking people in a meeting.. The block diagram of the system is shown in the following
Face Tracking
Robust tracking of multiple interacting people in indoors settings is of paramount importance for surveillance, human-machine interfaces and assistive living applications. The current approaches of extracting and modeling foreground body blobs fall short in resolving people in the camera views. We work around this problem by utilizing faces found in the body blobs. We propose a synergistic system of detectors and trackers. A robust foreground segmentation system gives body blobs, inside which, detectors locate faces. The misses are filled in by the use of trackers between detections. We use three different trackers (CAM-Shift, Kalman filters and particle filters). The block diagram of the 2D face tracker is shown in the following Figure:
3D Audio-Visual Tracking
Given people moving and interacting in a multi-sensor room, the system tracks them using audio/visual information. The system attempts to overcome difficulties imposed by the environment
- Audiovisual clutter (complex and moving background, audible noises)
- Lighting
- Pose relative to the cameras and microphones
- Reverberation
The system operates on far field recordings, with VGA (or higher) resolution cameras mounted on room corners (calibrated) and microphone clusters mounted on room walls, all synchronized.
Visual Tracking
Our approach for 3D tracking is a data-driven one, utilizing the 2D face tracks obtained from multiple cameras. We address the association of the views of the face of the same person from the different cameras using a 3D space to 2D image planes mapping. The space is spanned by a 3D grid. Each point of the grid is projected onto the different image planes. Faces whose centers are close to the projected points are associated to the particular 3D point. 3D points that have more than one face associated to them are used to form possible associations of views of the face of the same person from the different cameras.
Figure - Spanning the 3D space with cubes, projecting their center into the camera vies and collecting faces near the projections for 3D association.
Since the same face in a camera view cannot be a member of different valid associations, some of the associations are mutually exclusive. After eliminating duplicate associations, the remaining ones are grouped into possible sets of mutually exclusive associations and are sorted according to a weight that depends on the distance of each association from the face center and on the number of other associations that contradict it. All the mutually exclusive sets of possible associations are validated using a Kalman filter in the 3D space. For each new frame, all possible associations are compared to the 3D state established on the previous frame, penalizing solutions which fail to detect previously existing targets, or in which there are detections of new targets in the scene. While this strategy reduces the misses and false positives, it does not prevent new targets from appearing, as in the case of new people entering the room, all solution pairs will include that new target and thus will be equally penalized. The following Figure depicts typical screenshots of the 3D tracker.
Audio Tracking
Estimating the Direction Of Arrival (DOA) of acoustic signals relies on the successful estimation of the relative delay between pairs of microphone signals, a process known as Time Delay Estimation (TDE). Performance of the system then becomes a function of the effective employment of microphone arrays for the collection of data in frames, so that the current TDE estimate can be provided. When the recordings are performed in environments of strong multi-path reflections, algorithms often fail to distinguish between the true DOA and that of a dominant reflection. The problem of finding the correct relative delay between the two signals is equivalent to finding the delay that maximizes the Mutual Information (MI) between them. The MI calculation can be modified appropriately to contain enough information about the presence of reverberation and thus provide more accurate TDEs. The last step in the ASL process is the combination of several DOA estimates, in order to get the actual 3D coordinates of the speaker. This is performed by calculating the crossing points between the lines defined by the estimated DOAs and by filtering out the spurious ones.
Audio/Visual Tracking
The location estimates provided by the audio and video tracking modules are recursively combined by the use of a decentralized Kalman filter. The fusion system comprises of two linear local Kalman Filters and a two-input global one. The local Kalman Filters operate on the outputs of the modules for the two standalone modalities. The estimated audio and video states are then weighted according to the trust level assigned to every modality and fed to the global Kalman Filter. The block diagram of the system is shown in the next Figure:
Audio-Visual Person Identification
Given people moving and interacting in a multi-sensor room, images of their faces and speech signal can be captured. These enable audio/visual person identification that is robust to environmental difficulties (lighting, pose relative to the cameras and microphones, reverberation) and to the availability of the necessary signals (i.e. frontal faces and speech).
The system operates on far field recordings, with VGA-resolution cameras mounted on room corners and microphone clusters mounted on room walls. In such conditions the resolution of the recordings is not very high:
1. The images of the faces have typical eye distances of less that 10 pixels
2. The least significant bits of audio are corrupted by noise
To overcome these difficulties, a fusion scheme across classifiers (multiple algorithms), sensors (multiple cameras and microphones) and time is adopted. The input of the sensors is collected over a period of time and different classifiers operate on the relevant portions, returning an estimated identity and an estimation confidence per time instant. These are fused for the given time interval to yield a single ID estimate per classifier. The performance of different options for face recognition is shown in the next Figure:
The individual identities and confidences of the mono-modal classifiers involved are then fused into the single multi-modal identity.
Similarly, the audio files are split into short segments (frames) and each such frame is scored against all possible speaker models, yielding a log-likelihood score and a confidence (used in the multi-modal fusion process to indicate preference toward the audio or visual component). Obviously, the speaker whose model obtains the highest score across a recording is declared the winner. By using more than one microphone for both training and evaluation of classifiers, it is possible to apply different fusion schemes across time and further improve performance when compared to a single-microphone system. The following figure shows the effect of fusing input from six different microphones that are symmetrically spaced on a linear microphone array, when compared to using each microphone individually.
The individual identities and confidences of the mono-modal classifiers involved are then fused into the single multi-modal identity.
Involved Classifiers
The involved classifiers are:
- Principal Components Analysis (PCA), Linear Discriminant Analysis (LDA), sub-class LDA and Gaussian modeling of intrapersonal differences for faces.
- Gaussian Mixture Modeling (GMM) of the Mel Frequency Cepstral Coefficients (MFCC) for speech, as they are pre-processed by speaker-specific PCA.
Voice Activity Detection
The objective of Voice Activity Detection (VAD) systems is to determine whether the captured audio signals contain human speech or not. The employed microphones can be either close-talking or far-field. In the first case the comparison of the sound pressure level of the observed signal to an energy threshold suffices for the classification of the audio signal to speech and non-speech segments. In the second case though, where speech might be masked from ambient background noise more sophisticated techniques are required.
The developed system operates on audio signals captured from
- far-field microphones mounted on the walls
- a microphone array
It separates the audio signals into overlapping segments and classifies them as speech or non-speech accordingly.
To suppress the detrimental reverberation effects, these signals are added in order to perform a sort of spatial averaging. The inherent delay between the recorded signals is compensated prior to the spatial averaging process.
Segmentation Techniques:
Several Techniques have been applied in the direction of Voice Activity Detection:
- Linear discriminant analysis (LDA) is applied to the Mel frequency cepstra of the captured audio signals. An Energy Based Adaptive algorithm is utilized as a preprocessing step; LDA is then applied to a subset of the audio data. Training of the system is required in both cases.
- Hidden Markov Modeling (HMM) is used to model the statistical characteristics of silence and voice audio signals. A special dichotomizer has been built based on two left-right model structures (each one modeling silence or speech characteristics), the decisions of which are based on an adaptive threshold. The whole system design shows increased robustness when operating under dynamically changing environments.
Data Smoothing
To smooth out the derived estimates and thus prevent the characterization of very small segments as speech and/or silent intervals of speech as non-speech two techniques have been employed
- An automaton consisting of five states: silence, speech presumption, speech, plosive or silence, possible speech continuation
- Median filtering of the decisions
- Hangover scheme
Multi-Source Localization and Separation System
For the purposes of the group’s audio research areas, microphone arrays are used to control both audio and video sub-systems. Use of a multi microphone array system can significantly improve source localization and beamforming algorithms for speech recognition (ASR) and blind source separation (BSS). This is actually performed using both the NIST Mark III prototype array (seen in images) and a series of well established recording hardware.
Prototype systems are able to track up to two simultaneous speakers with great accuracy enabling the accurate steering of the PTZ camera to the current speaker. The face recognition process can then be triggered to produce the most accurate results. The system is also able to separate recordings of 4 to 5 simultaneous speakers. This improves the fidelity of the archived waveforms while enabling speech commands (using ASR) that would otherwise be impossible to recognize.
Autonomous Distributed Agents for Context-Awareness
In the scope of the CHIL project, we are building a non-obtrusive service assisting human in in-door activities (i.e. meetings, lectures, conferences). To this end, we have developed a mutli-agent architecture for fusing sensor information, detecting situations and ultimately implementing non-intrusive service logic. The implementation is based on the JADE agent platform (FIPA Agents), IBM’s Situation Composer (Situation Modelling for Context-Awareness), and NIST SmartFlow (Distributed Transfer of Data/Sensor Streams) according to the following architecture:
The system includes autonomous service oriented agents providing room wide services including: Text-To-Speech (TTS), Access to Database (Storage Service). Moreover, a Targeted Audio Service has also been implemented.
Logical Sensors API
Our smart room comprises a rich set of sensors. In the scope of context-aware application developments these sensors have to be controlled by software/middleware. As a result, we have produced sensor control middleware for all the sensors available in our smart room. Furthermore, in order to tackle heterogeneity and as an abstraction growth step, we have designed and implemented a virtual sensor API for controlling mutli-vendor sensor through a uniform interface. In the case of the various cameras available in our smart room, the concept is depicted in the following figure:
Resource Management System based on Dynamic Predictions
This system is part of our Grid computing research, and demonstrates resource balancing across a distributed heterogeneous virtualized infrastructure. The system is based on a predictor modules running on each one of the hosts. Predictors estimate execution times for jobs submitted by clients/users, based on time series forecasting models, as well as based on reference task denoting the machine’s capacity. The scheduling algorithm is totally integrated in the latest standards compliant version of the de facto middleware for Grid infrastructures, namely the Globus toolkit (version 3.2). The system has been developed in the Grid programming testbed is illustrated in the figure.
Mixed reality surface
A mixed reality system where users interact with real and virtual 3D objects in a 3D world controlled by a physics engine has been implemented. To build the system we propose a hand detection and tracking system that maintains information about the way multiple users move their palms and fingers in 3D space using the feed from two fixed uncalibrated cameras. The complete 3D system can process and renders more than 30 frames per second.
Multi-touch Surface
Using interfaces that require users to familiarize with several devices (e.g. the combination of a keyboard, mouse and computer monitor) could result to confusion and a demanding learning curve. An interactive surface can integrate such a design on the same physical device. This multi-touch surface interface has been designed within the framework of HERMES project in order to address the problem of usability and simplicity of human machine interfaces when used by elders. Based on hand-gestures that humans are already familiar with, the multi-touch surface enhances interaction simplicity and makes cognitive training games more appealing to elders.
The hardware is based on a modified TFT computer monitor to operate both as system input and output. Monitor layers have been separated so that we take advantage of the transparency of TFT panels when subjected to infrared (IR) illumination. An acrylic panel is placed on top of the TFT panel, the edges of which are illuminated by four IR-Light Emitting Diode (LED) arrays.
Due to the FTIR effect, a finger touch on the surface of the acrylic panel generates lighting blobs. The position of blobs that manage to penetrate the TFT panel is captured by a USB camera through an Ultraviolet /Visual (UV/VIS) cut optical filter.
The interactive surface designed yields very good quality images of the moving fingertips, hence a simple contactbased tracker is utilized for propagating the location of the fingertips across time. According to this approach, the evidence (frames from the NIR camera) are processed to extract the objects to track. We initiate a Kalman filter per detected fingertip, hence handling each fingertip independently, avoiding the complexity of joint target tracking.
The objects are the pool of two types of contacts: those used for track initialization, and those used at the measurement update stage of the Kalman tracker(s). At every frame, the fingertip tracking system reports the IDs and the positions of all active tracks (fingertips).
Integrated Development tools for RFID Application Development
A number of editing and management tools are implemented for enabling RFID consultants to easily build and deploy RFID solutions. The purpose of these tools is twofold:
- To minimize the programming and configuration effort required to implement and fully leverage an RFID solution.
- To manifest the programmability capabilities of the RFID middleware platform, through demonstrating that end-to-end RFID solution can be essentially built and deployed using these tools.
These editing tools deal with specification and configuration of middleware functionalities. The tools will be integrated in a single integrated development environment (IDE) for RFID applications.
The RFID IDE components provide means of configuration of the underlying ASPIRE infrastructure. The user by describing his requirements to the IDE, which provides him all the configuration options, will “translate” them into configuration messages by which it supplies all the appropriate underlying middleware modules.
The RFID IDE is an Eclipse RCP (Rich Client Platform) application that is running over Equinox OSGI server. Every tool is an eclipse plugin/bundle that is able to be installed or removed as needed. This way many editions of this IDE can be released depending on the functionalities required (as simple or as complicate depending on the demands) for the RFID middleware that will be implemented.
The RFID IDE supports the tools reported below in tiles:
- Management console
- RFID readers
- Reader Core proxy
- C Server
- BEG engine
- EPCIS repository
- Connector application
- Physical Reader Configuration
- Logical Reader Configuration
- LLRP readers
- RP readers
- HAL readers and
- Simulator readers
- Filtering Specifications Editor
- F&C Commands Execution
- Master Data Editor
- Business dispositions
- Business steps
- Business transactions
- Transactions type
- Business locations
- Read Points
- Connector Operations
- And a Workflow Management Editor
Students' Views
"The dormitory in which international students are situated is a story by itself and it represents AIT efforts to satisfy all students’ needs and bring a “home” atmosphere much closer, preventing even the most sensitive students from being homesick. For someone who never felt how it is to live together and share an academic and daily life with the friends from all over the world, it is hard to describe the flavor of cooked food in the late evening (early morning) hours, when all of the students return to the dorm, hungry and tired from 24+2 hours long working day. At this moment, the kitchen would become a big pan, where all national specialties would be cooked, starting from spicy - Venezuelan; full of salad - Lebanese; good boiled - Iranian; low caloric - Botswana; fishy - Philippine’s and other international meals. In a word, unforgettable experiences."
Milica Bogosavljev, MSITT 2006 (Serbia)

news list
News & Announcements
AIT's 1st Gaming Forum a Success!
2012-05-11
AIT's 1st Gaming Forum was a two day celebration of Gaming that brought together the Greek gaming industry with academics and gaming enthusiasts of all ages. The Gaming Forum was co-organized with the British Council and Cowboy Tv and realized thanks to the support of Grand Sponsor hellas online, Gold Sponsor intralot, Sponsor the EU project KnowInG and Grand Communication Sponsor MTV.
17 High School students participated in the ATHENA Exchange program co organized by AIT & INTRACOM Albania
2012-05-07
17 students and 3 professors from Arsakeio, Harry Fultz, Ismail Qemali, Sami Frasheri and Petro Nini Luarasi schools, spent 4 days of cultural and educational experiences.
Open Invitation to AIT Research Seminar, April 10, 2012
2012-04-03
AIT is pleased to invite you to a Research Seminar on "Technology-enabled social learning.". The Seminar will be conducted by Mr. Hristijan Petreski, Project Manager, Intracom.
AIT Alumni Club has 44 new members!
2012-03-16
"It is not the solution all of us to go abroad" said Mr. Robby Bourlas, Managing Director of the Public, Multirama and getitnow.gr addressing to 44 students representing Greece, Cyprus, Jordan, Romania, Malawi, and Armenia, graduated with a MSc in Management of Business, Innovation & Technology (MBIT) from AIT. "You can still find business opportunities in Greece, as long as you believe it. In a difficult period like the one we live in, it is just harder to find these opportunities."
Albanian high school students to visit AIT
2012-03-14
Athens Information Technology (AIT) in Greece, in collaboration with INTRACOM Telecom Albania, invites Albanian high school 3rd year students, to join a 4- day excursion in Athens, Greece, between April 28 - May 1, 2012.
|