Home / Issues / 3.2006 / Content Classification of Multimedia Documents using Partitions of Low-Level Features

Document Actions

Citation and metadata

Recommended citation

Edda Leopold, and Jörg Kindermann, Content Classification of Multimedia Documents using Partitions of Low-Level Features. JVRB - Journal of Virtual Reality and Broadcasting, 3(2006), no. 6. (urn:nbn:de:0009-6-7607)

Download Citation

Endnote

%0 Journal Article
%T Content Classification of Multimedia Documents using Partitions of Low-Level Features
%A Leopold, Edda
%A Kindermann, Jörg
%J JVRB - Journal of Virtual Reality and Broadcasting
%D 2007
%V 3(2006)
%N 6
%@ 1860-2037
%F leopold2007
%X Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, “video words” based on low level color features (color moments, color correlogram and color wavelet), and “audio words” based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62% and 94% corresponding to 50% - 84% above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.
%L 004
%K Audio-visual content classification
%K integration of modalities
%K speech recognition
%K support vector machines
%R 10.20385/1860-2037/3.2006.6
%U http://nbn-resolving.de/urn:nbn:de:0009-6-7607
%U http://dx.doi.org/10.20385/1860-2037/3.2006.6

Download

Bibtex

@Article{leopold2007,
  author = 	"Leopold, Edda
		and Kindermann, J{\"o}rg",
  title = 	"Content Classification of Multimedia Documents using Partitions of Low-Level Features",
  journal = 	"JVRB - Journal of Virtual Reality and Broadcasting",
  year = 	"2007",
  volume = 	"3(2006)",
  number = 	"6",
  keywords = 	"Audio-visual content classification; integration of modalities; speech recognition; support vector machines",
  abstract = 	"Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, ``video words'' based on low level color features (color moments, color correlogram and color wavelet), and ``audio words'' based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62{\%} and 94{\%} corresponding to 50{\%} - 84{\%} above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.",
  issn = 	"1860-2037",
  doi = 	"10.20385/1860-2037/3.2006.6",
  url = 	"http://nbn-resolving.de/urn:nbn:de:0009-6-7607"
}

Download

RIS

TY - JOUR
AU - Leopold, Edda
AU - Kindermann, Jörg
PY - 2007
DA - 2007//
TI - Content Classification of Multimedia Documents using Partitions of Low-Level Features
JO - JVRB - Journal of Virtual Reality and Broadcasting
VL - 3(2006)
IS - 6
KW - Audio-visual content classification
KW - integration of modalities
KW - speech recognition
KW - support vector machines
AB - Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, “video words” based on low level color features (color moments, color correlogram and color wavelet), and “audio words” based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62% and 94% corresponding to 50% - 84% above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.
SN - 1860-2037
UR - http://nbn-resolving.de/urn:nbn:de:0009-6-7607
DO - 10.20385/1860-2037/3.2006.6
ID - leopold2007
ER -

Download

Wordbib

<?xml version="1.0" encoding="UTF-8"?>
<b:Sources SelectedStyle="" xmlns:b="http://schemas.openxmlformats.org/officeDocument/2006/bibliography"  xmlns="http://schemas.openxmlformats.org/officeDocument/2006/bibliography" >
<b:Source>
<b:Tag>leopold2007</b:Tag>
<b:SourceType>ArticleInAPeriodical</b:SourceType>
<b:Year>2007</b:Year>
<b:PeriodicalTitle>JVRB - Journal of Virtual Reality and Broadcasting</b:PeriodicalTitle>
<b:Volume>3(2006)</b:Volume>
<b:Issue>6</b:Issue>
<b:Url>http://nbn-resolving.de/urn:nbn:de:0009-6-7607</b:Url>
<b:Url>http://dx.doi.org/10.20385/1860-2037/3.2006.6</b:Url>
<b:Author>
<b:Author><b:NameList>
<b:Person><b:Last>Leopold</b:Last><b:First>Edda</b:First></b:Person>
<b:Person><b:Last>Kindermann</b:Last><b:First>Jörg</b:First></b:Person>
</b:NameList></b:Author>
</b:Author>
<b:Title>Content Classification of Multimedia Documents using Partitions of Low-Level Features</b:Title>
<b:Comments>Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, “video words” based on low level color features (color moments, color correlogram and color wavelet), and “audio words” based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62% and 94% corresponding to 50% - 84% above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.</b:Comments>
</b:Source>
</b:Sources>

Download

ISI

PT Journal
AU Leopold, E
Kindermann, J
TI Content Classification of Multimedia Documents using Partitions of Low-Level Features
SO JVRB - Journal of Virtual Reality and Broadcasting
PY 2007
VL 3(2006)
IS 6
DI 10.20385/1860-2037/3.2006.6
DE Audio-visual content classification; integration of modalities; speech recognition; support vector machines
AB Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, “video words” based on low level color features (color moments, color correlogram and color wavelet), and “audio words” based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62% and 94% corresponding to 50% - 84% above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.
ER

Download

Mods

<mods>
  <titleInfo>
    <title>Content Classification of Multimedia Documents using Partitions of Low-Level Features</title>
  </titleInfo>
  <name type="personal">
    <namePart type="family">Leopold</namePart>
    <namePart type="given">Edda</namePart>
  </name>
  <name type="personal">
    <namePart type="family">Kindermann</namePart>
    <namePart type="given">Jörg</namePart>
  </name>
  <abstract>Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, “video words” based on low level color features (color moments, color correlogram and color wavelet), and “audio words” based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62% and 94% corresponding to 50% - 84% above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.</abstract>
  <subject>
    <topic>Audio-visual content classification</topic>
    <topic>integration of modalities</topic>
    <topic>speech recognition</topic>
    <topic>support vector machines</topic>
  </subject>
  <classification authority="ddc">004</classification>
  <relatedItem type="host">
    <genre authority="marcgt">periodical</genre>
    <genre>academic journal</genre>
    <titleInfo>
      <title>JVRB - Journal of Virtual Reality and Broadcasting</title>
    </titleInfo>
    <part>
      <detail type="volume">
        <number>3(2006)</number>
      </detail>
      <detail type="issue">
        <number>6</number>
      </detail>
      <date>2007</date>
    </part>
  </relatedItem>
  <identifier type="issn">1860-2037</identifier>
  <identifier type="urn">urn:nbn:de:0009-6-7607</identifier>
  <identifier type="doi">10.20385/1860-2037/3.2006.6</identifier>
  <identifier type="uri">http://nbn-resolving.de/urn:nbn:de:0009-6-7607</identifier>
  <identifier type="citekey">leopold2007</identifier>
</mods>

Download

Full Metadata

Bibliographic Citation	JVRB, 3(2006), no. 6.
Title	Content Classification of Multimedia Documents using Partitions of Low-Level Features (eng)
Author	Edda Leopold, Jörg Kindermann
Language	eng
Abstract	Audio-visual documents obtained from German TV news are classified according to the IPTC topic categorization scheme. To this end usual text classification techniques are adapted to speech, video, and non-speech audio. For each of the three modalities word analogues are generated: sequences of syllables for speech, “video words” based on low level color features (color moments, color correlogram and color wavelet), and “audio words” based on low-level spectral features (spectral envelope and spectral flatness) for non-speech audio. Such audio and video words provide a means to represent the different modalities in a uniform way. The frequencies of the word analogues represent audio-visual documents: the standard bag-of-words approach. Support vector machines are used for supervised classification in a 1 vs. n setting. Classification based on speech outperforms all other single modalities. Combining speech with non-speech audio improves classification. Classification is further improved by supplementing speech and non-speech audio with video words. Optimal F-scores range between 62% and 94% corresponding to 50% - 84% above chance. The optimal combination of modalities depends on the category to be recognized. The construction of audio and video words from low-level features provide a good basis for the integration of speech, non-speech audio and video.
Subject	Audio-visual content classification, integration of modalities, speech recognition, support vector machines
Classified Subjects	swd 4265353-8, Inhaltsanalyse 4003961-4, Automatische Spracherkennung 4505517-8, Support-Vektor-Maschine
DDC	004
Rights	DPPL
URN:	urn:nbn:de:0009-6-7607
DOI	https://doi.org/10.20385/1860-2037/3.2006.6

JVRB - Journal of Virtual Reality and Broadcasting

Sections

Document Actions

Citation and metadata

Recommended citation

Download Citation

Endnote

Bibtex

RIS

Wordbib

ISI

Mods

Full Metadata