Skip to content.
Sections

 

1.  Introduction

Content processing of speech, non-speech audio and video data is one of the central issues of recent research in information management. During the last years new methods for the classification of text, audio, video and voice information have been developed, but Multimodal analysis and retrieval algorithms especially towards exploiting the synergy between the various media is still considered as one of the major challenges of future research in multimedia information retrieval [ LSDJ06 ]. The combination of features from different modalities should lead to an improvement of results. We present an approach to supervised multimedia classification that allows to benefit from the joint exploitation of speech, video and non-speech audio.

We use low-level features such as color correlograms, spectral flatness, and syllable sequences for the integrated classification of audio-visual documents. The novelty of our approach is to process non-speech information in such a way that it can be represented jointly with linguistic information in a generalised term-frequency vector. This allows for subsequent processing by usual text-mining techniques including text classification, semantic spaces, and topic-maps.

Support Vector Machines (SVM) have been applied successfully to text classification tasks [ Joa98, DPHS98, DWV99, LK02 ]. We adapt common SVM text classification techniques to audio-visual documents which contain speech, video, and non-speech audio data. To represent these documents we apply the bag-of-words approach which is common to text classification. We generate word analogues for the three modalities: sequences of phonemes or syllables for speech, “video-words” based on low level color features for video, and “audio-words” based on low-level spectral features for general audio.

We assume that there is a hidden code of audio-visual communication. This code cannot be made explicit, it consists of a tacit knowledge that is shared and used by the individuals of a communicating society. Furthermore we assume that, for the purpose of subsequent classification, the unknown hidden code can be substituted by an arbitrary partition of the feature space. Our approach is inspired by the fenone recognition technique which is an alternative to standard speech recognition for classification purposes. Fenone recognition has been done successfully by Harbeck [ Har01 ] for the speech domain. Instead of using a standard speech recognizer, which recognises phonemes — i.e. areas in the feature space that are defined by linguistic tradition — a cluster analysis is performed, which segments the feature space in a data driven fashion. The recognized fenones serve as analogues to phonemes and are forwarded to a subsequent classification procedure. The advantage of fenones over phonemes is that they can be calculated even if there is no a priory knowledge of the language and consequently of the code under consideration. Their disadvantage is that they do not accommodate human interpretation.

Thus as proposed by [ Leo02 ] each element of a partition,i.e. a disjoint segmentation, of the feature space can be considered as a unit — sign — of the audio-visual code whereas the partition itself is the respective vocabulary of audio or video-signs. The visual vocabularies that we create are not sets of elementary symbols, but abstract subsets of the feature space, which is defined by the attribute values of visual low-level features. They have nothing in common with visual vocabularies that have been used elsewhere [ FGJ95 ] for visual programming purposes. Instead they resemble to what is called a mood chart in the area of visual design [ RPW06 ].

The reason for using an automatically generated vocabulary inducing a set of artificial (and therefore inexplicable) concepts is that detecting a wider set of concepts other than human faces in images or video scenes turned out to be fairly difficult. Lew [ Lew00 ] showed a system for detecting sky, trees, mountains, grass, and faces in images with complex backgrounds. Fan, et al. [ FGL04 ] used multi-level annotation of natural scenes using dominant image components and semantic concepts. Li and Wang [ LW03 ] used a statistical modeling approach in order to convert images into keywords. Rautiainen, et al. [ RSP03 ] used temporal gradients and audio analysis in video to detect semantic concepts.

The non-speech audio vocabularies that we create are not a set of elementary motifs of harmonic stereotypes, but abstract subsets of the feature space, which is defined by the attribute values of acoustic low-level features. Our corpus of audio-visual documents consists of news recordings which contain very few segments that could be called musical in a narrower sense. Therefore automatic music transcription as for example in [ CLL99 ] is not appropriate for our material.

One further reason for the application of non-speech audio and video signs is that we want to take into account the entire document while preserving the essential information present in small temporal segments. Thus mapping short time intervals to video or audio signs as prototypical representations seems to be a promising approach for the representation of audio-visual scenes. Finally the application of audio and video signs is an integrated and successful approach to the fusion of speech, video and non-speech audio. We thus present a theoretical framework for the combination of audio and video information. This is currently considered as a challenging task [ LSDJ06 ].

In the next chapter we describe the corpus of audio-visual documents that was utilized for content classification. Feature extraction and the representation of audio-visual documents in the classifier′s input space is described in section 3. In section 4 we specify the classifier and its parameters. Results are presented in section 5 and in section 6 we conclude.

Table 1.  Size of IPTC-classes in terms of number of documents. Only those classes, which contain more than 45 documents (left column) were considered.

Categories and number of documents

politics

200

human interest

40

justice

120

disaster

38

advertisement

119

culture

22

sports

91

jingle

22

conflicts

85

health

19

economy

68

environmental issue

17

labour

49

leisure

15

science

13

education

10

weather

8

social issue

6

religion

4

2.  The Audio-Visual Corpus

Figure 1. Rank distribution of document frequency of video and audio signs. The left panel shows the document frequency of the visual signs generated from the thee low level features described in section 3.2 . The right panel shows the document frequency for two audio features described in section 3.3 . Sizes of the visual and acoustic vocabulary are 400 and 200 signs respectively.

Rank distribution of document frequency of video and audio signs. The left panel shows the document frequency of the visual signs generated from the thee low level features described in section 3.2 . The right panel shows the document frequency for two audio features described in section 3.3 . Sizes of the visual and acoustic vocabulary are 400 and 200 signs respectively. Rank distribution of document frequency of video and audio signs. The left panel shows the document frequency of the visual signs generated from the thee low level features described in section 3.2 . The right panel shows the document frequency for two audio features described in section 3.3 . Sizes of the visual and acoustic vocabulary are 400 and 200 signs respectively.

The data for the audio-visual corpus was obtained from two different German news broadcast stations: N24 and n-tv. The audio-visual stream was segmented manually into news items. This resulted in a corpus that consists of 693 audio-visual documents. Document length ranges between 30 sec. and 3 minutes. The semantic labeling of the news stories was done manually according to the categorization scheme of the International Press Telecommunications Council (IPTC) (see http://www.iptc.org). The material from N24 consists of 353 audio-visual documents and covers the period between May 15 and June 13, 2002 (including reports from the World Cup soccer tournament in Korea and Japan. This event can be considered as semantically unique. It does not appear in the training corpus for generating the audio-visual “vocabularies”, which were obtained from tv recordings of October 2002). The data from n-tv comprises 340 documents and covers the last seven days of April 2002. Table 1 shows the distribution of topic classes in the corpus. For convenience we added two classes “advertisement” and “jingle” to the 17 top level classes of the IPTC-categorization. The number of documents in the classes total more than 693. Some documents were attributed to two or three classes because of the ambiguity of their content. For example, audio-visual documents on the Israel-Palestine conflict often were categorized as belonging to both “politics” and “conflicts”.

The size of the classes in the audio-visual corpus varies considerably: “politics” comprises 200 audio-visual documents whereas “religion” contains just 4. We only used those seven categories with more than 45 documents (shown in the left column of table 1 ) for classification experiments. As will be described in section 4 we trained a separate binary classifier for each of these seven classes. All documents of the small categories (less than 45 documents) were always put in the set of counter-examples.

In Figure 1 the document frequency, i.e. the number of documents in which a given sign occurs, is calculated for different audio and video signs. Document frequency is a term weight that is commonly used in text mining applications in order to quantify how good a given term serves as an indicator of a document′s content. A term that occurs in all documents of a corpus is not considered as a useful indicator, whereas a term that occurs in few classes is supposed to be specific to the document′s content. It can be seen that most of the visual signs have a medium document frequency, which makes them useful for content classification. The non-speech audio signs show a less favourable pattern, rendering it inferior for content classification: most of the audio signs occur in more than a quarter of the documents. Some audio signs occur in nearly every document. As mentioned before the whole corpus comprises 693 documents.

In the same fashion as a speech recognizer has to be trained in order to acquire a model of the speech to be recognized, the vocabularies for video and non-speech audio have to be generated from a training corpus. In order to obtain significant results the training data has to be different from the data to be classified. Therefore the audio-visual corpus described above was not used for the generation of the visual and non-speech vocabulary, with the exception of some control experiments marked as “corpus” that are presented in figure 3 and 4 .

3.  Feature Extraction

The feature extraction procedures for each of the three modalities speech, video and non-speech audio are independent from each other. As in [ PLL02 ] syllables are stringed together to form terms, that do not necessarily correspond to linguistic word boundaries. Video signs and non-speech audio signs are defined as subsets of the respective feature space. Sequences of video signs are referred to as video words, and sequences of non-speech audio signs are called audio words.

3.1.  Speech

The automatic speech recognition system (ASR) was built using the ISIP (Institute for Signal and Information Processing) public domain speech recognition tool kit from Mississippi State University. We implemented a standard ASR system based on a Hidden Markov Model. The ASR used cross-word tri-phone models trained on seven hours of data recorded from radio documentary programs. These included both commentator speech and spontaneous speech in interviews, and were thus similar to speech occurring in TV news of our corpus.

The audio track of each audio-visual document was speaker segmented using the BIC algorithm [ TG99 ]. Breaks in the speech flow were located with a silence detector and the segments were cut at these points in order to insure that no segment be longer than 20 seconds. However, the acoustic signal was not separated into speech and non-speech segments. Therefore the output of the speech recognizer also consists of nonsense syllable sequences generated by the recognizer during music and other non-speech audio.

Table 2.  Number of syllable-n-grams in the audio-visual corpus. The number of syllable-n-grams in the running text (tokens) as well as of different syllable-n-grams (types) are displayed.

Number of types (2nd line) and tokens (last line) of syllable n-grams

n = 1

n = 2

n = 3

n = 4

n = 5

n = 6

3766

71505

153789

177478

182492

183707

189894

189212

188532

187853

187175

186497

The language model was trained on texts which were decomposed into syllables using the syllabification module of the BOSSII speech synthesis system [ SWH00 ]. Exploratory investigation allowed us to determine that 5000 syllables give good recognition performance. The syllable language model was a syllable tri-gram model and was trained on 64 million words from the German dpa newswire. The advantage of using a syllable-based language model instead of a word-based model is that words can be generated from syllables on-the-fly which leads to reduction of vocabulary size, less domain dependency and therefore less out-of-vocabulary errors. A syllable base language model is especially useful when the ASR is applied to a language which is highly productive at the morpho-syntactic level, like German in our case [ LEP02 ].

From the recognized syllables, n-grams (1 ≤ n ≤ 6) were constructed in order to reach a level of semantic specificity comparable to that of words. The use of n-grams also makes it possible to adjust the linguistic units appropriate to the trade-off between semantic specificity and low probability of occurrence, which is especially important when document classes are small. From previous experiments on textual data we have observed that unit size is among the most important determinants of the classification accuracy of support vector machines [ PLL02 ].

As the number of syllable n-grams in the audio-visual documents is large, a statistical test is used to eliminate unimportant ones. First it is required that each term must occur at least twice in the corpus. In addition, the hypothesis that there is a statistical relation between the document class under consideration and the occurrence of a term is investigated by a χ²-statistic. A term is rejected when its χ² statistic is below a threshold θ. The values of θ used in the experiments are θ = 0.1 and θ = 1.

3.2.  Video

In the same way as a speech recognizer has to be trained in order to learn a vocabulary of phonemes (acoustic model) and how they are combined to form syllables or words (language model), a visual vocabulary has to be learned from training data. The generation of such a visual vocabulary was done on a training corpus of video data from recorded TV news broadcasts. This training corpus is different from the corpus described in the preceding section. It contains 11 hours of video sampled in October 2002.

First the video data was split into individual frames. To reduce the huge amount of frames, only one frame per second of video material was selected. This could be done because the similarity between neighbouring frames is usually high. In this manner the frame count was reduced by approximately a factor 3. After that, the three low-level features were extracted from the reduced set of frames and the buoy generation method [ Vol02 ] was applied to their respective feature spaces, generating a disjoint segmentation. In this way sets of prototypical images were obtained for each of the three features. These sets are the visual vocabularies and their elements are the video signs. Visual vocabularies of the size of 100, 200, 400 and 800 video signs were created.

Table 3.  The number of running words (tokens) and of different video words (types) is displayed for a vocabulary size of 400 video signs based on different features, A: moments of 29 colors, B: correlogram of 9 colors, C: wavelet of 9 colors

Number video sign n-grams in the AV-corpus

n = 1

n = 2

n = 3

n = 4

n = 5

n = 6

A

types

363

5305

6947

6749

6380

5967

tokens

9060

8378

7696

7128

6617

6128

B

types

363

5495

7005

6793

6421

6005

tokens

9060

8378

7696

7128

6617

6128

C

types

298

5050

6909

6789

6416

5997

tokens

9060

8378

7696

7128

6617

6128

To represent the video scenes of the audio-visual corpus the video stream is first segmented into coherent units (shots). Then for each shot a representative image is selected, which is called the key frame of the shot. The segmentation is done by algorithms monitoring the change of image over time. Two adjacent frames are compared and their difference is calculated. The differences are summed, and when the sum exceeds a given threshold a shot-boundary is detected, and the key frame of the shot is calculated. For each shot there are three low-level features, which are extracted from its key frame: first and second moments of 29 colors, a correlogram calculated on the basis of 9 major colors, and a wavelet that was also based on 9 major colors. These features where chosen because they combine aspects of color and texture. Each of the three visual features is mapped to the nearest video-sign in the respective visual vocabulary. From the video signs, n-grams (1 ≤ n ≤ 6) were generated by stringing video signs together according to their sequence in the audio-visual corpus. These n-grams are also referred to as “video words”.

As mentioned above, the results presented in the paper were obtained by using visual vocabularies that were generated from a training corpus of 11 hours of video in October 2002. Note that the audio-visual corpus to be classified was sampled three to six months before the training corpus, in the period between April and June 2002 (see section 2 ). In order to get an insight into the temporal variation of the visual vocabulary of the communicating society, we generated two additional visual vocabularies. One was drawn from the test corpus itself (before October 2002). The other was created from January 2003 (three months after October 2002). Interestingly, the comparison of results based on the different vocabularies shows little difference. (see table 4 ).

3.3.  Non-Speech Audio

The low-level audio features that we used were audio spectrum flatness and audio spectrum envelope as described in MPEG-7-Audio. Audio spectrum flatness was measured for 16 frequency bands ranging from 250 Hz to 4 kHz for every audio frame of 30 msec. The audio spectrum envelope was calculated for 16 frequency bands ranging from 250 Hz to 4 kHz plus additional bands for the low-frequency (below 250 Hz) and high-frequency (above 4 kHz) signals.

Exploratory investigation allowed us to conclude that sensible sizes of the acoustic vocabulary vary between 50 and 200 audio signs. Mean and variance were calculated for the features of 4, 8 and 16 consecutive audio frames. We suspect that units of 16 audio frames (=480msec) enable us to capture (non-linguistic) meaning-related properties of the audio signal. This duration corresponds to a quarter-note in alegretto tempo (mm = 120). Shorter units of 8 audioframes (240msec) correspond to the typical length of a syllable — in conversational English nearly 80% of the syllables have a duration of 250 msec. or less [ WKMG98 ] — as well as to the duration of the echoic memory, which can store 180 to 200 msec [ Hug75 ]. Units of the length of 4 audio features were also considered. They roughly correspond to the average length of a phoneme.

Table 4.  Number of audio words in the audio-visual corpus. The number of running words (tokens) and of different words (types) are displayed for a vocabulary size of 50 audio signs of 4-frame segments based on two different features, A: spectral envelope, B: spectral flatness.

Number of audio-sign n-grams in the AV-corpus

n = 1

n = 2

n = 3

n = 4

n = 5

n = 6

A

types

50

2377

54k

175k

246k

280k

tokens

364k

364k

363k

363k

362k

361k

B

types

50

2500

79k

249k

321k

343k

tokens

365k

364k

363k

363k

362k

361k

Generation of the non-speech acoustic vocabulary was done on the same training corpus that was also used for video sign creation (October 2002). For both audio features, mean and variance of spectral flatness and spectral envelope were calculated and the buoy generation method [ Vol02 ] was applied to their feature spaces, generating a disjoint segmentation. In this way sets of prototypical non-speech audio patterns were obtained for each feature. These sets are the acoustic vocabularies and their elements are the respective non-speech audio signs. Vocabularies of 50, 100 and 200 audio signs were created for audio signs of 4, 8 and 16 frame segments respectively. We construct n-gram sequences (n ≤ 6) from these acoustic units by stringing consecutive audio signs together. This leads to non-speech audio words of up to roughly 3 sec., which corresponds to the psychological integration time [ Poe85 ] or the typical length of a musical motif.

4.  The Classification Procedure

4.1.  Preprocessing

The feature extraction described in the preceding section resulted in sequences of syllable-n-grams, video words and non-speech audio words for each audio-visual document. The notion “term” is used here for any of the three units. For each document a vector of counts of terms is created to form a term-frequency vector. The term-frequency vector contains the number of occurrences for each n-gram in a document. Therefore each audio-visual document di is represented by its term-frequency vector

(1)
where rj is an importance weight as described below, wj is the j-th term, and f(wj, di) indicates how often wj occurs in the video scene di . Term-frequency vectors are normalized to unit length with respect to L1 . In the subsequent tables the use of these normalized term-frequencies is indicated by “rel”. The vector of logarithmic term-frequencies of a video scene di is defined as

(2)

Logarithmic frequencies are normalized to unit length with respect to L2 . Other combinations of norm and frequency transformation were omitted because they appeared to yield worse results. In the tables below the use of logarithmic term-frequencies is indicated by “log”.

Importance weights like the well-known inverse document frequency (see figure 1 ) are often used in text classification in order to quantify how specific a given term is to the documents of a collection. Here however another importance weight, namely redundancy, is used. In information theory the usual definition of redundancy is maximum entropy (log N) minus actual entropy. So redundancy is calculated as follows: consider the empirical distribution of a term over the documents in the collection and define the importance weight of term wk by

(3),
where f(wk, di) is the frequency of occurrence of term wk in document ti and N is the number of documents in the collection. The advantage of redundancy over inverse document frequency is that it does not simply count the documents that a type occurs in, but takes into account the frequencies of occurrence in each of the documents. Since it was observed in previous work [ LK02 ] that redundancy is more effective than inverse document frequency, two experimental settings are considered in this paper: term frequencies f(wk,di) are multiplied by rk as defined in equation ( 3 ) (denoted by “+” at column “red” in subsequent tables); or term frequencies are left as they are: rk ≡ 1 (denoted by “-” ). For subsequent classification an audio-visual document di is represented by fi or li according to the parameter settings.

4.2.  Support Vector Machines

A Support Vector Machine (SVM) is a supervised learning algorithm that has been successful in proving itself an efficient and accurate text classification technique [ Joa98, DPHS98, DWV99, LK02 ]. Like other supervised machine learning algorithms, an SVM works in two steps. In the first step — the training step — it learns a decision boundary in input space from preclassified training data. In the second step — the classification step — it classifies input vectors according to the previously learned decision boundary. A single support vector machine can only separate two classes — a positive class (y = +1) and a negative class (y = -1).

Figure 2. Operating mode of a Support Vector Machine. The SVM algorithm seeks to maximise the margin around a hyperplane that separates a positive class (marked by circles) from a negative class (marked by squares).

Operating mode of a Support Vector Machine. The SVM algorithm seeks to maximise the margin around a hyperplane that separates a positive class (marked by circles) from a negative class (marked by squares).

In the training step the following problem is solved: Given is a set of training examples Sl = {(x1,y1),(x2,y2),...,(xl,yl)} of size l from a fixed but unknown distribution p(x,y) describing the learning task. The term-frequency vectors xi represent documents and yi ∈ {-1,+1} indicates whether a document has been labeled with the positive class or not. The SVM aims to find a decision rule h : x → {-1,+1} that classifies documents as accurately as possible based on the training set Sl .

The hypothesis space is given by the functions f(x) = sgn(wx + b) where w and b are parameters that are learned in the training step and which determine the class separating hyperplane. Computing this hyperplane is equivalent to solving the following optimization problem [ Vap98 ], [ Joa02 ]:

minimize:
subject to:

The constraints require that all training examples are classified correctly, allowing for some outliers symbolized by the slack variables ξ i . If a training example lies on the wrong side of the hyperplane, the corresponding ξ i is greater than 0. The factor C is a parameter that allows for trading off training error against model complexity. In the limit C → ∞ no training error is allowed. This setting is called hard margin SVM. A classifier with finite C is also called a soft margin Support Vector Machine. Instead of solving the above optimization problem directly, it is easier to solve the following dual optimisation problem [ Vap98, Joa02 ]:

minimize:
subject to:
(4)

All training examples with α i > 0 at the solution are called support vectors. The Support vectors are situated right at the margin (see the solid circle and squares in figure 2 ) and define the hyperplane. The definition of a hyperplane by the support vectors is especially advantageous in high dimensional feature spaces because a comparatively small number of parameters — the αs in the sum of equation ( 4 ) — is required.

In the classification step an unlabeled term-frequency vector is estimated to belong to the class
ŷ = sgn (wx + b)
(5)

Heuristically the estimated class membership ŷ corresponds to whether x belongs on the lower or upper side of the decision hyperplane. Thus estimating the class membership by equation ( 5 ) consists of a loss of information since only the algebraic sign of the right-hand term is evaluated. However the value of v = wx + b is a real number and can be used for voting agents, i.e. a separate SVM is trained for each modality resulting in three values vspeech ,vvideo and vaudio . Instead of calculating equation ( 5 ) we calculate ŷ = sgn(g(vspeech ,vvideo ,vaudio )) where g(∙) is the sum or the maximum or another monotone function of its arguments. We have experimented with different settings of this kind but with little success.

It is well known that the choice of the kernel function is crucial to the efficiency of support vector machines. Therefore the data transformations described above were combined with the following different kernel functions:

  • Linear kernel (L):
    K(xi,xj) = xi∙xj
  • 2nd and 3rd order polynomial kernel (P(d)):
    K(xi,xj) = (xi∙xj)d  d=2, 3
  • Gaussian rbf-kernel (R(&gamma)):
    K(xi,xj)=e-γ||xi-xj||  γ=0.2, 1, 5
  • Sigmoidal kernel (S):
    K(xi,xj) = tanh(xi∙xj)

(6)

In some of the experiments these kernel functions were combined to form composite kernels, which use different kernel functions for each modality (for example L for speech, R(1) for video and P(3) for audio). Formally a composite kernel is defined as follows: Let the input space consist of Ls speech attributes, Lv video attributes, and La audio attributes, which are ordered in such a way, that dimension 1 to Ls correspond to speech attributes, dimensionsLs +1 correspond to Ls + Lv video attributes, and dimensionsLs + Lv +1 to Ls + Lv + La correspond to audio attributes. Let be the projection from the input space to its subspace spanned by dimensions k to l. A composite kernel that uses kernel K1 for speech, K2 for video and K3 for audio is defined as

The idea behind the construction of composite kernels is that the different semiotic and cognitive conditions for speech, video and audio imply different geometries in the respective factor spaces. I.e. we treat audio, video, and speech differently although we represent them in the same input space. A kernel is called a homogeneous kernel if K1 = K2 = K3 . The results of experiments with composite kernels were, however, also disappointing and suggested that homogeneous kernels are the best solution to the integration of modalities.

We think that this negative result is interesting, because the fact that the different modalities speech, video and non-speech audio do not require different treatment suggests that the respective semiotic systems are not as independent as it is often supposed.

4.3.  Settings for Classification of Audio-Visual Documents

We use a soft margin Support Vector Machine with assymetric classification cost in a 1-vs-n setting, i.e. for each class an SVM was trained that separates this class against all other classes in the corpus. The cost factor by which the training errors on positive examples outweigh errors on negative examples is set to , where #pos and #neg are the number of positive and negative examples respectively. This means that the weight of false positive training errors is larger for smaller classes, and in the case #neg = #pos positive examples on the wrong side of the margin are given twice the weight of negative examples. The trade-off between training error and margin was set to which is the default in the SVM implementation that we used.

It is well known that the choice of kernel functions is crucial to the efficiency of support vector machines. Therefore the data transformations described above were combined with the homogeneous kernel functions defined in equation ( 6 ).

5.  Results

The following tables show the classification results on the basis of the different modalities. A “+” in the column “red.” indicates that the importance-weight redundancy is used, and “-” indicates that no importance weight is used. The values of the significance threshold θ (used exclusively for syllables in speech experiments) are θ = 0.1 and θ = 1. The column “transf.” indicates the frequency transformation that was used, “log” stands for logarithmic frequencies with L2 -normalization and “rel” means relative frequencies (i. e. frequencies with L1 -normalization). The next column “kernel” indicates the kernel function: L is the linear kernel, S is a sigmoidal kernel, and P(d) and R(γ) denote the polynomial kernel and the rbf-kernel respectively. The last column shows the classification result in terms of the F-score, which is calculated as where rec and prec are the usual definitions of recall and precision [ MS99 ]. Since a 1-to-n scheme was used for classification the results of classifying each class against all other classes are presented in individual rows. All classification results presented in this section were obtained by tenfold crossvalidation, where the vocabulary is held constant. This makes the results statistically reliable. Crossvalidation involving vocabulary generation is unnecessary because the data set used for the generation of the vocabulary is separate from the multimedia corpus.

Note that a correlation matrix between features from different modalities cannot be presented in a meaningful way. Each of the three modalities is represented by more than 1000 features (see table 2 , 3 and 4 ) and this is would lead to a correlation matrix with more than 109 entries.

5.1.  Classification Based on Speech

The results on speech-based classification for the optimal combinations of parameters are presented in in table 5 . From the speech recognizer output syllable-n-grams were constructed for n = 1 to n = 6. Most classes were best classified with rbf-kernels (cf. table 5 ).

Table 5.  Results of the classification on the basis of syllable sequences.

Results based on speech

category

n

red.

θ

transf.

kernel

F-score

justice

1

+

1.00

rel

R(1)

65.0

economy

2

+

1.00

rel

P(2)

59.3

labour

1

+

1.00

rel

R(1)

85.3

politics

2

+

1.00

rel

R(0.2)

74.7

sport

1

+

1.00

rel

R(5)

80.3

conflicts

2

+

1.00

rel

R(0.2)

73.5

advertis.

1

-

1.00

log

R(0.2)

85.0

Note that only sequences of one or two syllables were used for classification. This replicates an earlier result [ PLL02 ]: The optimal unit-size for spoken document classification is often smaller than a word (in German the average word length is ~ 2.8 syllables) especially under noisy conditions.

5.2.  Classification Based on Video

Table 6 shows results based on a vocabulary of 100 video signs, table 7 those for 400 video signs. The length of the video words (i.e. the length of the n-gram) is given in column 2 and the accuracy is presented in column 6. In the case of a visual lexicon of 100 video signs the units used for classification are n-grams with a size varying from n = 1 to n = 5. This means that these units are built using one to five video-shots. Those categories that are classified on the basis of shot-unigrams show relatively poor accuracy.

Table 6.  Classification results based on a visual vocabulary of 100 video signs

F-scores obtained using small visual vocabulary

category

n

red.

transf.

kernel

F-score

justice

4

-

log

S

46.4

economy

3

-

rel

R(5)

29.1

labour

2

-

log

S

41.6

politics

4

-

rel

R(1.0)

48.5

sport

5

+

rel

R(0.2)

51.9

conflicts

2

-

log

R(0.2)

35.0

advertis.

1

-

log

R(1)

85.7

We therefore believe that we have detected regularities in the succession of video-units, which reveal a kind of temporal (as opposed to spatial) video-syntax. Rbf-kernels seem to be the most appropriate for classification on the basis of video-words when a small set of video signs is considered.

Table 7.  Classification results based on a visual vocabulary of 400 video signs

F-scores obtained using large visual vocabulary

category

n

red.

transf.

kernel

F-score

justice

1

-

log

S

42.0

economy

1

+

log

L

31.2

labour

3

-

log

S

31.2

politics

2

-

rel

S

53.4

sport

4

+

rel

R(5)

53.8

conflicts

1

-

rel

L

39.8

advertis.

5

+

rel

R(1)

91.2

With more video signs to choose from, the performance increases significantly (with the exception of the categories “labour” and “justice”), and the n-gram length decreases.

We attribute this to the fact that the semantic specificity ofn-grams increases with n. As units from a larger vocabulary are on average semantically more specific than units from a smaller one, the specificity of video-words obtained from the larger vocabulary is compensated by a decrease of n-gram degree. Results for optimal parameter settings and different sizes of the visual vocabulary are presented in figure 3 . Vocabularies of 400 video signs yielded the best results on average. This vocabulary size is also used for integration of modalities described in section 5.4 .

Figure 3. Classification performance vs. vocabulary size. Different classes show different behaviour when the vocabulary size is changed. A vocabulary size of 400 video signs seems to be optimal. Note that the visual vocabulary which was obtained from the test corpus itself (labeled as “corpus”) does not yield better classification results compared to the others.

Classification performance vs. vocabulary size. Different classes show different behaviour when the vocabulary size is changed. A vocabulary size of 400 video signs seems to be optimal. Note that the visual vocabulary which was obtained from the test corpus itself (labeled as “corpus”) does not yield better classification results compared to the others.

One might argue that the collection of a visual vocabulary at a point in time different from the test corpus is flawed because typical images cannot be present in both corpora. However our principal assumption was that the video words reveal a kind of implicit code, which is known to the individuals of a given society. The assumption of the existence of such a code implies that it is shared by the members of the society and functions as a means to convey (non-linguistic) information. To fulfil this communicative function a code may not vary too quickly and should apply to past and future alike. As can be seen in figure 4 , experimental results with visual lexicons created at different times (summer 2002 and January 2003) did not show a consistent change in performance and support the assumption that the vocabulary is independent from the acquisition date. For the practical application this means that once a visual vocabulary is generated it can be used for a long period of time.

From figure 4 one can see that the effect of the change of the visual semiotic system is limited. This is reflected in the results of the classification. Categories “justice” and “sports” and to a lesser extent “politics” show a decrease of performance when the lexicon was drawn from the October material instead of the corpus itself. This can be attributed to the fact that there were salient news in these categories at the time when the corpus was sampled, namely the soccer world championship (sports) and a massacre at a German high school (justice).

The results for the categories “economy” and “conflicts” are nearly independent from the creation date of the visual lexicon. These categories are communicated by visual signs that seem to be temporarily invariant.

The relatively good results for video classification suggest that the task of supervised content classification of audio-visual news stories is different form the recognition of objects on images for retrieval purposes. News stories are not pictures of reality. They are man-made messages intended to be received by the news observers. Thus the regularities between content and visual expression follow aesthetical rules rather than reality itself.

In his semiotic analysis of images Roland Barthes distinguishes between the denoted message and a connoted message of a picture. In his view all imitative arts (drawings, paintings, cinema, theater) comprise two messages: a denoted message, which is the analogon itself, and a connoted message, which is the manner in which the society to a certain extent communicates what it thinks of it. The denoted message of an image is an analogical representation (a ′copy′) of what is represented. For instance the denoted message of an image which shows a person is the person itself. Therefore the denoted message of an image is not based on a true system of signs. It can be considered as a message without a code. The connotive code of a picture in contrast results from the historical or cultural experience of a communicating society. The code of the connoted system is constituted by a universal symbolic order and by a stock of stereotypes (schemes, colors, graphisms, gestures. expressions, arrangements of elements). [ Bar88 ]

The rationale behind the use of low-level video features is not to discover the denoted message of a video-artefact (whether it shows for instance a person or a car) but to reveal the implicit code which underlies its connoted message. We suppose that some of the aspects of the connoted code postulated by Barthes are reflected in the video words.

Figure 4. Classification performance vs. date of vocabulary acquisition. The visual vocabularies were generated from different training corpora which were sampled at different instances of time: January 2003, October 2002 and April to June 2002, denoted as “corpus”. The figure shows that the variation between classes is larger than the variation between different lexicon creation times.

Classification performance vs. date of vocabulary acquisition. The visual vocabularies were generated from different training corpora which were sampled at different instances of time: January 2003, October 2002 and April to June 2002, denoted as “corpus”. The figure shows that the variation between classes is larger than the variation between different lexicon creation times.

5.3.  Classification Based on Non-speech Audio

Classification on the basis of non-speech audio words is shown in table 8 . The error rates are worse than those of the other two modalities (speech and video). The only category which is accurately classified is “advertisement”. Most classifications are based on audio sign unigrams, i.e. audio words that consist of only one audio sign. Building sequences of audio signs generally does not improve the performance. This means that, in contrast to video words, the audio words do not represent a kind of temporal syntax.

Table 8.  Comparison of results for non-speech audio with different audio signs and different sizes of non-speech audio vocabularies.

F-scores obtained using different sizes of non-speech audio vocabulary

50 signs

100 signs

200 signs

4 frames

8 frames

16 frames

4 frames

8 frames

16 frames

4 frames

8 frames

16 frames

justice

35.4

34.6

35.2

36.0

35.4

37.3

36.9

36.1

35.5

economy

21.1

22.5

22.8

23.9

23.5

20.6

20.1

24.0

24.5

labour

18.9

17.0

20.6

21.8

16.8

18.6

19.6

18.2

19.6

politics

47.0

46.9

46.1

48.6

47.3

46.6

45.7

46.7

44.9