Tamara Broderick: Stats Seminar, Feature allocation

Feature allocation, probability functions, and paintboxes

Intro

  • unsupervised learning
  • canonical example, clustering
  • what if objects are a part of multiple groups?
    • feature allocation
    • each group is now called a feature not a cluster
  • Assumptions
    • exchangeable
    • finite number of features per datapoint (can’t have more animals than pixels in photo)
  • Definitions:
    • exchangeable partition probability function (frequency of features, exchangeable since the order doesn’t matter (cat/dog/mouse, mouse/cat/dog).
    • feature case: chose features in proportion to their occurrence frequency.
    • function of the number of data points as well as the frequencies of the features
  • does every feature allocation have an exchangeable probability function? No
    • Counterexample: not all feature sets with same number of data points and same number of features per data point have same probability.

Kingman Paintbox

  • start with unit interval (1D), partition at random (countably infinite elements).
  • draw randomly from this interval. If they are in the same partition, ID them in the same cluster
  • so this is clearly an exchangeable partition — can reorganize the roles.
  • Reverse is also true, if you have an exchangable partition, there is a corresponding partition interval (paintbox).

Add a twist: (feature paintbox)

  • what if groups aren’t necessarily mutual exclusive? subintervals can overlap.
  • so a uniform random draw can intersect multiple features.
  • changing the order of the data points doesn’t matter (they came from a uniform random draw)
  • The reverse we prove to also be true — for every feature map, there is a corresponding feature paintbox
  • draw K Poisson. For each K draw a frequency of size q Beta distributed frequency size.
  • How to allocate features for the Indian Buffet problem?
    • sizes of sub-boxes are determined by their frequency
    • what about their overlap? For every combination of feature 1, there is a dependent and indpendent fraction. For every combination of feature 1 and 2, there is a feature 3 overlap and non-overlap fraction.

are feature frequency models and EFPMs the same space of distributions?

  • Yes all EFPFs can be represented by a feature frequency model and vice verca.

Recap

  • feature paintbox is a characterization of exchangeable feature models
  • Discussion of different classes and overlap of clustering/feature models exchangeable clusters

How do we learn a structure

  • most popular unsupervised approach — K-means. (easy, fast, parallizable)
  • Disadventages: only good for a specific K clusters.
  • Alternative: Nonparametric Bayes
    • Modular,
    • flexible (K can grow as data grows),
    • coherent treatment of uncertainty.
    • not efficient on large data

MAD-Bayes perspective

  • Inspiration
    • finite gaussian mixture model.
    • start with nonparameteric Bayes model, take a limit to get to a Kmeans like objective.
    • kmeans — assign clusters to cluster centers, minimize distance to cluster centers.
    • definitions: kmeans objective is the minimization of the euclidean distance.
    • Approach: assign all data points (in parallel) to one of K clusters, measure distances. Iterate and minimize distance.
  • Our model
    • not just learn mean, but learn full probability distribution
    • Our objective: Maximum a Posteriori distribution: maximize probability of parameters given data.
  • analogies:
    • Mixture of gaussians, k- means
    • beta process – learn feature maps (?)
  • example: each feature is a sum of Gaussian bumbs.
  • if each data point belonged to one and only one cluster this returns k-means
  • Algorithm
    • assign each data point to features
    • create new feature if it lowers the objective
    • update feature means
  • example problem: pictures of objects on tables. Want to find features (ID items on tables).
    • MAD-Bayes faster
    • get closer annotation, can get perfect match all objects correct and correct number of features.
  • what are we giving up?
    • don’t enforce size distribution type
    • don’t learn full posterior model. Don’t get systematic treatment of uncertainty.
  • parallizing
    • challenge, feature choice for each data points depends on current options of features.
    • chose from current list, update in next step from all new created features.

Questions

  • why do we care about getting the right number of means?
    • alternative just chose something bigger than what we expect and not worry about the ‘dust’ of small things that get assigned their own clusters.
    • reply: sometimes more aesthetically appealing not to cap. species discovery — get common ones first, have to look a lot to find infrequent ones.
Posted in Seminars | Comments Off on Tamara Broderick: Stats Seminar, Feature allocation

Protected: Hao practice talk

This content is password protected. To view it please enter your password below:

Posted in Project Meeting | Comments Off on Protected: Hao practice talk

Sunday 02/02/14

10:00a – 7:30p, 9:50p – 1:40a

Literature

Ph project

Data analysis

  • start analysis of PhM data on Cajal.
  • getting images
  • There’s got to be a better way than this
    • A.dax + B.ini/B.xml -> Alist.bin/Aalist.bin + Apars.txt
  • To recover the ROI we do
    • Alist.bin -> Apars.txt -> B.ini/B.xml. Read ROI.

Working on manuscript revisions

Thoughts on manuscript organization

[Some thoughts on the side: Maybe this could be figure 1 by itself as something simple and short. It’s a less complicated experiment I think then some of the later stuff, but I think it’s also a clear and solid message, and one that really takes a direct debate in the field.
Maybe we could combine figures 5 and 7? They show the mutation affects chromatin binding (at some loci) and gene expression (at some genes). These are absolutely critical experiments, and I think the data is fully convincing. But it’s not really blow-you-away data, the reader anticipates these two effects, and the effects we find are convincing, but small (especially with the Fig 7). So maybe we don’t want to focus so much attention, especially on the expression data, as its own figure — as a final figure I think it’s not a strong punch. I think this is impart a consequence of the cell variability – a reasonable fraction of the total cells in several of the mutant transfections are expressing pretty low levels of the protein, and we systematically don’t pick these guys to image. But we have no way to systematically exclude them in transcriptome profiling (unless you can stain with flag and sort with FACS or something crazy, which I don’t recommend).
Or maybe make the S2 STORM fig 1, combine the multi-color STORM and the Ph-FLAG in fig2? I kinda want to try the DNA FISH immuno doubles (not sure our antibodies will work well, DNA FISH doesn’t play well with a number of antibodies/antigens), but I’m hoping the anti-FLAG at least will still work after FISH. I bet BXC and ANTC colocalize with small clusters of Ph when they are outside of the obvious PcG bodies.

Data collection

  • start STORM imaging of WT-Ph-Flag + Psc-647 stain
  • running O/N
  • still need to image calibration beads

Chromatin Project

  • Imaging new stains of ANTC in fresh cells, to correct for partial breakdown of sample.
Posted in Summaries | Tagged , , | Comments Off on Sunday 02/02/14

Journal club 12/16/13: Light activated K channel

Journal club

Posted in Journal Club | Comments Off on Journal club 12/16/13: Light activated K channel

Saturday 02/01/14

6:30p

Goals

  • break down STORM4 run DONE
  • Modify STORMrender savedata to save in color images, matching display (currently doing bw)
    • also ensure axis square (axis image);

Mentoring

  • helping Guipeng realign STORM4 for new fast camera
Posted in Summaries | Comments Off on Saturday 02/01/14

Friday 01/31/14

10:50a – 11:50p

Goals Today

  • Order primers for sequencing
  • work on Ph manuscript
  • finish hybes to test sample age
  • image F03F04 samples

Deep sequencing 2

  • probes to order with sequencing indices
    • D12 (375kb) GTGTCGCGTCGGCCAGAAAC
    • F11 (180kb) AGGACATTCGCGGCTTTCAG
    • G09 (76kb) CGTCGCGTTGGATTCAAGAG
    • F12 (17kb) GCGAACGGGCGAACTGTTAC
  • probes from Hao’s library to order with sequencing indices
    • Foward primers: add NEB-universal
    • reverse primers: add NEB-idnex
  • Jeff’s new dual indexing primers
  • added to primer table with NEB sequencing adapters

Some interesting literature

Ph Project

Manuscript

  • working on figures
  • Just stats:
    • just clusters (no weighting)
    • Ph-Flag: 552,497 unique clusters from 17 cells
    • PhM-Flag: 872,640 unique clusters from 34 cells
    • S2 wt cntrl: 474,832, unique clusters from 42 cells
    • KS test of PhWT-flag to S2 1.2E-64.
    • KS test of S2 to WT is 0.
  • sent revised Fig 1 to Ajaz (has group meeting again. Kingston lab seems to have group meetings quite a lot more frequently).

Data anlysis

  • configured new RAID drive (12 TB available).
  • transferred double stain data from PhWt and PhM to new drive.

Chromatin project

Cell staining

  • finish hybes with fresh cells of F03+F04 G01+G02

STORM

  • STORM of F03+F04. Calbiration spots very bright large conventional, still much tighter than last time. Further evidence that we had a cell degradation issue
  • I think this partially also affected the ANTC that were imaged a week before the FO3+F04 last time. Let’s repeat that too.
  • New stains: ANTC (1.1 uL of primary each, .5 uL of secondary). Also G05 alone, and G09 repeat (though I’m pretty sure this one was fine. Can always use repeats, and I actually made a very good batch of this probe).
Posted in Summaries | Tagged , , , | Comments Off on Friday 01/31/14

Thursday 01/30/14

9:55a – 10:00p

Chromatin Project

cell fixing and staining

  • fix new Kc cells
  • prep for in situ
  • treat with RNase
  • stain new kc cells with F03F04 and G01G02

Ph

Data analysis

  • wt data has half the number of frames (54,000 vs 102,000) and half the average number of localizations
  • this creates a problem:
    • Fewer localizations means that increasing the binSize in the clustering algorithm causes visibly apparent clusters to be split apart.
    • More total localizations increases the fraction of small clusters weakly connected by chance localizations, artificially increasing the cluster size.
  • potential solution
    • for a fixed number of clusters, if we double the number of total localizations, we should be able to sample the area at twice the coverage and still maintain the same localization density. This should also avoid spurious new localization linking existing clusters (which would happen if we didn’t change the sampling density). To halve the area (twice the coverage) we change the sampling dimension by sqrt(2). So lets rescale the binsize by the root of the localizations (or localizations per cell).
Posted in Summaries | Tagged , | Comments Off on Thursday 01/30/14

Wednesday 01/29/14

9:10a – 8:00p, 9:50p – 2:30a

Ph project

revising manuscript draft

  • working on abstract
  • refocusing discussion of PcG clusters. Need to distinguish between the sparse micron-scale PcG bodies other people have studied and the numerous, nano-scale bodies we focus on.

Notes on Ph project organization / motivation

previous work has divided PcG organization into two structural categories: PcG bodies, and dispersed nuclear signal.

It is not entirely evident the large bodies visible by conventional microscopy are relevant to gene regulation:
1. For example, many (most?) Pc target genes are found outside of these bodies in a large fraction of cells.
2. These bodies disappear in hypertonic solution, but DAPI dense and H3K27me3 dense domains remain intact. (and presumably so does silencing?)

Our work demonstrates that much of this distinction is a consequence of technical limitations of conventional microscopy, rather than a fundamental difference in organization: PcG proteins are organized in bodies throughout the nucleus. This cluster size is power-law distributed, and only the largest clusters are distinguishable to conventional imaging methods — the small ones just blur together.

We propose that clustering is an important part of PcG repressive activity, and that it is mediated in part through the SAM domain of Ph. This clustering is important not just for the formation of large PcG “bodies”, but for clusters of PcG proteins at a whole range of scales, which function in efficient silencing of target genes.
Additionally we find clustering independent loci … (are these also silenced? e.g. does expression of Ph clustering independent genes change in PcG knockdown or Ph mutant backgrounds?)

Ph data analysis

  • just need more data on size distribution of bodies. Can get some decent data from the 647 Ph imaging in the respective Ph and PhWt backgrounds (even though the 750 didn’t work well). Imaging of 647 flag indicates high transfection rate (just variable levels). So we should just be able to use nuclei from the slide straight. Be
  • Starting writing ClusterStats function for analysis of Ph data. Each localization gets assigned a cluster and we report that cluster’s area

Team project / mentoring

  • reviewing slides for Hao’s group meeting, 11a-2:30p

Chromatin Project

Cell staining

  • Finish hybridizations for F03+F04 and G01+G02 cells.

STORM

  • start imaging F03 + F04 as a contiguous ~200 kb region of BX-C
  • dots seem spread out. These cells were also fixed a while ago (~1/6), I think we shouldn’t keep them this long.
  • plan to repeat staining with fresh fixed cells tomorrow
Posted in Summaries | Tagged , , | Comments Off on Wednesday 01/29/14

Tuesday 01/28/14

9:15a – 10:50a, 12:00p – 10:15p

Chromatin Project

Chromatin Analysis

Data Analysis

  • Continued analyzing D12 data from 10-26-13 up through image 26 (still plenty more dots to image).
  • F06 analyzed up to but not including image 0_2.

Developing analysis pipeline

  • updated CC to record locus name
  • mI is not computed correctly — this value changes depending on my zoom.
  • found origin of bug in mI computation: multiplied xy by pixelsize before subtracting centroid, so now the two distance measures were in different units. Will need to recompute for current data.
  • fixed bug in mI computation in ChromatinCropper.
  • running script to fix mI computation in all currently analyzed data.
  • if this looks better, let’s rerun it and record the new values into saved data structures.

Working on updated drift correction

  • changed method of calculating drift error
  • changed method of determining the guide bead
  • added
  • changed function name to function naming conventions (now called FeducialDriftCorrection.m). Added original function to a Depreciated folder — need to fade out function calls to this form feducialDriftCorrection.m

New cell staining

  • F03 + F04 (both P1 / A647)
  • G01 + G02 (both P1 / A647)

project2

  • computing number of hamming codes with 4 ones for 32 hybes. (1240)
  • trick is to just use initial codewords with 4 or fewer ones before assigning parity bits
  • scaling up to 5 still yields memory errors. Because matlab won’t matrix multiply logicals.
Posted in Summaries | Tagged , | Comments Off on Tuesday 01/28/14

Protected: chromatin analysis in progress

This content is password protected. To view it please enter your password below:

Posted in Chromatin | Tagged , | Comments Off on Protected: chromatin analysis in progress