Andrew McCallum's Publications, by topic

Andrew McCallum

Contact Info
Bio & Affiliations
Vita
Teaching
Publications
Research & Projects
Code & Data
Students & other collab's
Activities & Events
Personal

Links:
UMass ML Seminar

Selected Publications by Topic (since 2001)

[ by topic | by date ]

Shortcuts:

Social Network Analysis and Clustering
Coreference and Object Correspondence
Efficient Inference and Learning in Graphical Models
Joint Inference for NLP
Information Extraction
Semi-supervised Learning, Active Learning, Interactive Learning
Bioinformatics
Computer Vision, Networking, etc.
Text Classification

Social Network Analysis and Clustering

Group and Topic Discovery from Relations and Text. Xuerui Wang, Natasha Mohanty and Andrew McCallum. KDD Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD) 2005. (Social network analysis that simultaneously discovers groups of entities and also clusters attributes of their relations, such that clustering in each dimension informs the other. Applied to the voting records and corresponding text of resolutions from the U.S. Senate and the U.N., showing that incorporating the votes results in more salient topic clusters, and that different groupings of legislators emerge from different topics.)
Topic and Role Discovery in Social Networks. Andrew McCallum, Andres Corrada-Emmanuel and Xuerui Wang. IJCAI, 2005. (Conference paper version of tech report by same authors in 2004 below. Also includes new results with Role-Author-Recipient-Topic model. Discover roles by social network analysis with a Bayesian network that models both links and text messages exchanged on those links. Experiments with Enron email and academic email.)
The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email. Andrew McCallum, Andres Corrada-Emmanuel, Xuerui Wang. Technical Report UM-CS-2004-096, 2004. (Also presented the NIPS'04 Workshop on " Structured Data and Representations in Probabilistic Models for Categorization") (Social network analysis that not only models links between people, but the word content of the messages exchanged between them. Discovers salient topics guided by the sender-recipient structure in data, and provides improved ability to measure role-similarity between people. A generative model in the style of Latent Dirichlet Allocation.)
Disambiguating Web Appearances of People in a Social Network. Ron Bekkerman and Andrew McCallum. WWW Conference, 2005. (Find homepages and other Web pages mentioning particular people. Do a better job by leveraging a collection of related people.)
Multi-Way Distributional Clustering via Pairwise Interactions. Ron Bekkerman, Ran El-Yaniv and Andrew McCallum. ICML 2005. (Distributional clustering in multiple feature dimensions or modalities at once--made efficient by a factored representation as used in graphical models, and by a combination of top-down and bottom-up clustering. Results on email clustering, and new best results on 20 Newsgroups.)
Extracting Social Networks and Contact Information from Email and the Web. Aron Culotta, Ron Bekkerman and Andrew McCallum. Conference on Email and Spam (CEAS) 2004. (Describes an early version of an end-to-end system that automatically populates your email address book with a large social network, including "friends-of-friends," and information about people's expertise.)
An Exploration of Entity Models, Collective Classification and Relation Description. Hema Raghavan, James Allan and Andrew McCallum. KDD Workshop on Link Analysis and Group Detection, August 2004. (Part of a student synthesis project: includes an application of RMNs to classifying people in newswire.)

Coreference and Object Correspondence

Practical Markov Logic Containing First-order Quantifiers with Application to Identity Uncertainty. Aron Culotta and Andrew McCallum. Technical Report IR-430, University of Massachusetts, September 2005. (Use existental and universal quantifiers in Markov Logic, doing so practially and efficiently by incrementally instantiating these terms as needed. Applied to object correspondence, this model combines the expressivity of BLOG with the predictive accuracy advantages of conditional probability training. Experiments on citation matching and author disambiguation.)
Joint Deduplication of Multiple Record Types in Relational Data. Aron Culotta and Andrew McCallum. Fourteenth Conference on Information and Knowledge Management (CIKM), 2005.
(Longer Tech Report version: A Conditional Model of Deduplication for Multi-type Relational Data. Technical Report IR-443, University of Massachusetts, September 2005. (Leverage relations among multiple entity types to perform coreference collectively among all types. Uses CRF-style graph partitioning with a learned distance metric. Experimental results on joint coreference of both citations and their venues showing that accuracy on both improves.)
A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance. Andrew McCallum, Kedar Bellare and Fernando Pereira. Conference on Uncertainty in AI (UAI), 2005. (Train a string edit distance function from both positive and negative examples of string pairs (matching and mismatching). Significantly, the model designer is free to use arbitrary, fancy features of both strings, and also very flexible edit operations. This model is an example of an increasingly popular interesting class---conditionally-trained models with latent variables. Positive results on citations, addresses and names.)
Disambiguating Web Appearances of People in a Social Network. Ron Bekkerman and Andrew McCallum. WWW Conference, 2005. (Find homepages and other Web pages mentioning particular people. Do a better job by leveraging a collection of related people.)
Conditional Models of Identity Uncertainty with Application to Noun Coreference. Andrew McCallum and Ben Wellner. Neural Information Processing Systems (NIPS), 2004. (A model of object consolidation, based on graph partitioning with learned edge weights. Conference paper version of 2003 work in KDD Workshop on Data Cleaning.)
An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. Ben Wellner, Andrew McCallum, Fuchun Peng, Michael Hay. Conference on Uncertainty in Artificial Intelligence (UAI), 2004. (A conditionally-trained graphical model for identity uncertainty in relational domains, representing mentions, entities and their attributes. Also a first example of joint inference for extraction and identity uncertainty--coreference decisions actually integrate out uncertainty about information extraction.)
Object Consolidation by Graph Partitioning with a Conditionally-trained Distance Metric. Andrew McCallum and Ben Wellner. KDD Workshop on Data Cleaning, Record Linkage and Object Consolidation, 2003. (Later, improved version of workshop paper immediately below.)
Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference. Andrew McCallum and Ben Wellner. IJCAI Workshop on Information Integration on the Web, 2003. (A conditionally-trained model of object consolidation, based on graph partitioning with learned edge weights.)

Efficient Inference and Learning in Graphical Models

Fast, Piecewise Training for Discriminative Finite-state and Parsing Models. Charles Sutton and Andrew McCallum. Center for Intelligent Information Retrieval Technical Report IR-403. 2005. (Further results with "piecewise training", a method also described in a UAI'05 paper below.)
Piecewise Training for Undirected Models. Charles Sutton and Andrew McCallum. UAI, 2005. (Efficiently train a large graphical model in separately normalized pieces, and amazingly often obtain higher accuracy than without this approximation. This paper also shows that this piecewise objective is a lower bound on the exact likelihood, and gives results with three different graphical model structures.)
Constrained Kronecker Deltas for Fast Approximate Inference and Estimation. Chris Pal, Charles Sutton, Andrew McCallum. Submitted to UAI, 2005. (Sometimes the graph of the graphical model is not large and complex, but the cardinality of the variables is large. This paper describes a new and generalized method for beam search on graphical models, showing positive experimental results for both inference and training. Experiments on NetTalk.)
Piecewise Training with Parameter Independence Diagrams: Comparing Globally- and Locally-trained Linear-chain CRFs. Andrew McCallum and Charles Sutton. Center for Intelligent Information Retrieval, University of Massachusetts Technical Report IR-383. 2004. (Also presented at NIPS 2004 Workshop on Learning with Structured Outputs.) (Large undirected graphical models are expensive to train because they require global inference to calculate the gradient of the parameters. We describe a new method for fast training in locally-normalized pieces. Amazingly the resulting models also give higher accuracy than their globally-trained counterparts.)

Joint Inference for NLP

Joint Parsing and Semantic Role Labeling. Charles Sutton and Andrew McCallum. CoNLL (Shared Task), 2005. (Attempt to improve accuracy by performing joint inference over parsing and semantic role labeling---preserving uncertainty and multiple hypotheses in Dan Bikel's parser. Unfortunately the effort yielded negative results, most likely because the components needed to produce better calibrated probabilities.)
Collective Segmentation and Labeling of Distant Entities in Information Extraction. Charles Sutton and Andrew McCallum. ICML workshop on Statistical Relational Learning, 2004. (Makes the boundaries and types of distant segments inter-dependent by augmenting a linear-chain CRF with additional long, arching edges. Approximate inference by Tree-Reparameterization.)
Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. Charles Sutton, Khashayar Rohanimanesh and Andrew McCallum. ICML 2004. (Joint inference over two traditionally-separate layers of NLP processing: POS-tagging and NP-chunking. Introduces the CRF analogue of Factorial HMMs. Compares several approximate inference procedures.)
Dynamic Conditional Random Fields for Jointly Labeling Multiple Sequences. Andrew McCallum, Khashayar Rohanimanesh and Charles Sutton. NIPS*2003 Workshop on Syntax, Semantics, Statistics, 2003. (Workshop version of ICML 2004 paper.)
A Note on the Unification of Information Extraction and Data Mining using Conditional-Probability, Relational Models. Andrew McCallum and David Jensen. IJCAI'03 Workshop on Learning Statistical Models from Relational Data, 2003. (Describes big-picture motivation and approach for research that performs information extraction and data mining in an integrated fashion, rather than in two separate serial steps. Lays out a major thrust of my current research over a multi-year span.)

Information Extraction

Feature Bagging: Preventing Weight Undertraining in Structured Discriminative Learning. Charles Sutton, Michael Sindelar, and Andrew McCallum. Center for Intelligent Information Retrieval, University of Massachusetts Technical Report IR-402. 2005. (Avoid a common under-appreciated problem: overly heavy reliance on a few discriminative features which may not be as reliably present in the testing data. Discusses four methods of separate training and combination, and presents statistically-significant improvements---including new best results on CoNLL-2000 NP Chunking.)
Composition of Conditional Random Fields for Transfer Learning. Charles Sutton and Andrew McCallum. Proceedings of Human Language Technologies / Emprical Methods in Natural Language Processing (HLT/EMNLP) 2005. (Improve information extraction from email data by using the output of another extractor that was trained on large quantities of newswire. Improve accuracy further by using joint inference between the two tasks---so that the final target task can actually affect the output of the intermediate task.)
Reducing Labeling Effort for Structured Prediction Tasks. Aron Culotta and Andrew McCallum. AAAI, 2005. (A step toward bringing trainable information extraction to the masses! Make it easier for end-users to train IE by providing multiple-choice labeling options, and propagating any constraints their labels provide on portions of the record-labeling task.)
Extracting Social Networks and Contact Information from Email and the Web. Aron Culotta, Ron Bekkerman and Andrew McCallum. Conference on Email and Spam (CEAS) 2004. (Describes an early version of an end-to-end system that automatically populates your email address book with a large social network, including "friends-of-friends," and information about people's expertise.)
Accurate Information Extraction from Research Papers using Conditional Random Fields. Fuchun Peng and Andrew McCallum. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004. (Applies CRFs to extraction from research paper headers and reference sections, to obtain current best-in-the-world accuracy. Also compares some simple regularization methods.)
Chinese Segmentation and New Word Detection using Conditional Random Fields. Fuchun Peng, Fangfang Feng, and Andrew McCallum. Proceedings of The 20th International Conference on Computational Linguistics (COLING 2004) , August 23-27, 2004, Geneva, Switzerland. (State-of-the art Chinese word segmentation with CRFs, with rich features and many lexicons; also using confidence estimation to add new words to the lexicon.)
Confidence Estimation for Information Extraction. Aron Culotta and Andrew McCallum. Proceedings of Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2004, short paper. (How to provide not only an answer, but a formally-justified confidence in that answer--using contrained forward-backward.)
Rapid Development of Hindi Named Entity Recognition Using Conditional Random Fields and Feature Induction. Wei Li and Andrew McCallum. ACM Transactions on Asian Language Information Processing, 2003. (How we developed a named entity recognition system for Hindi in just a few weeks.)
Efficiently Inducing Features of Conditional Random Fields. Andrew McCallum. Conference on Uncertainty in Artificial Intelligence (UAI), 2003. (CRFs give you the great power to include the kitchen sink worth of features. How do you decide which ones to include to avoid over-fitting and running out of memory? A formal, information-theoretic approach, with carefully-chosen approximations to make it efficient with millions of candidate features. This technique key to success in Hindi above, as well as work by Pereira's group at UPenn)
Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. Andrew McCallum and Wei Li. Seventh Conference on Natural Language Learning (CoNLL), 2003. (This is the first publication about named entity extraction with CRFs.)
Table Extraction Using Conditional Random Fields. David Pinto, Andrew McCallum, Xing Wei and W. Bruce Croft. Proceedings of the ACM SIGIR, 2003. (Application of CRFs to finding tables in government reports. Uses both language and layout features.)
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. John Lafferty, Andrew McCallum and Fernando Pereira. ICML-2001. (A conditionally-trained model for sequences and other structured data, with global normalization. The original CRF paper. Don't bother reading the section on parameter estimation---use BFGS instead of Iterative Scaling; e.g. see [McCallum UAI 2003].)

Semi-supervised Learning, Active Learning, Interactive Learning

Semi-Supervised Sequence Modeling with Syntactic Topic Models. Wei Li and Andrew McCallum. AAAI, 2005. (Learn a low-dimensional manifold from large quantities of unlabled text data, then use components of the manifold as additional features when training a linear-chain CRF with limited labeled data. The manifold is learned using HMM-LDA [Griffiths, Steyvers, Blei, Tenenbaum 2004], an unsupervised model with special structure suitable for sequences and topics. Experimens with English part-of-speech tagging and Chinese word segmentation.)
Reducing Labeling Effort for Structured Prediction Tasks. Aron Culotta and Andrew McCallum. AAAI, 2005. (A step toward bringing trainable information extraction to the masses! Make it easier for end-users to train IE by providing multiple-choice labeling options, and propagating any constraints their labels provide on portions of the record-labeling task.)
Interactive Information Extraction with Constrained Conditional Random Fields. Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. Nineteenth National Conference on Artificial Intelligence (AAAI 2004). San Jose, CA. (Winner of Honorable Mention Award.) (Help a user interactively correct the results of extraction by providing uncertainty cues in the UI, and by using constrained Viterbi to automatically make additional corrections after the first human correction.)
A Note on Semi-supervised Learning using Markov Random Fields. Wei Li and Andrew McCallum. Technical Note, February 3, 2004. (A general framework for semi-supervised learning in Conditional Random Fields, with a focus on learning the distance metric between instances. Experimental results with collective classification of documents.)
Learning with Scope, with Application to Information Extraction and Classification. David Blei, Drew Bagnell and Andrew McCallum. Conference on Uncertainty in Artificial Intelligence (UAI), 2002. (Learn highly reliable formatting-based extractors on the fly at test time, using graphical models and variational inference. Describes both generative and conditional versions of the model.)
Toward Optimal Active Learning through Sampling Estimation of Error Reduction. Nick Roy and Andrew McCallum. ICML-2001. (A leave-one-out approach to active learning.)

Bioinformatics

Gene Prediction with Conditional Random Fields. Aron Culotta, David Kulp, and Andrew McCallum. Technical Report UM-CS-2005-028, University of Massachusetts, Amherst, April 2005. (Use finite-state CRFs to locate introns and exons in DNA sequences. Shows the advantages of CRFs' ability to straightforwardly incorporate homology evidence from protein databases.)

Computer Vision, Networking, etc

Sign Detection in Natural Images with Conditional Random Fields. Jerod Weinman, Al Hansen and Andrew McCallum. IEEE International Workshop on Machine Learning for Signal Processing, 2004. (Part of a student synthesis project: a grid-shaped CRF with inference by belief-propagation with Tree-Reparameterization.)
Detecting Anomalies in Network Traffic Using Maximum Entropy Estimation. Yu Gu, Andrew McCallum and Don Towsley. Internet Measurement Conference, 2005. (Build a density model of normal Internet traffic with Maximum Entropy and feature induction. Detect network attacks by density threshold.)

Text Classification

Collective Multi-Label Classification. Nadia Ghamrawi and Andrew McCallum. Fourteenth Conference on Information and Knowledge Management (CIKM), 2005. (Multi-label document classification with a conditional maximum entropy model that captures not only the traditional dependences between words and the class labels, but also the coocurrence dependencies between the class labels. Performs joint inference among all class labels.)
Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora. Ron Bekkerman, Andrew McCallum and Gary Huang. UMass CIIR Technical Report IR-418, 2004. (Extensive experiments on real-world email foldering.)
Classification with Hybrid Generative/Conditional Models. Rajat Raina, Yirong Shen, Andrew Y. Ng, Andrew McCallum. Proceedings of Neural Information Processing Systems (NIPS), 2003. (Train some parameters generatively, some parameters conditionally.)