Tamara Broderick: Stats Seminar, Feature allocation

Posted on February 3, 2014 by admin

Feature allocation, probability functions, and paintboxes

Intro

unsupervised learning
canonical example, clustering
what if objects are a part of multiple groups?
- feature allocation
- each group is now called a feature not a cluster
Assumptions
- exchangeable
- finite number of features per datapoint (can’t have more animals than pixels in photo)
Definitions:
- exchangeable partition probability function (frequency of features, exchangeable since the order doesn’t matter (cat/dog/mouse, mouse/cat/dog).
- feature case: chose features in proportion to their occurrence frequency.
- function of the number of data points as well as the frequencies of the features
does every feature allocation have an exchangeable probability function? No
- Counterexample: not all feature sets with same number of data points and same number of features per data point have same probability.

Kingman Paintbox

start with unit interval (1D), partition at random (countably infinite elements).
draw randomly from this interval. If they are in the same partition, ID them in the same cluster
so this is clearly an exchangeable partition — can reorganize the roles.
Reverse is also true, if you have an exchangable partition, there is a corresponding partition interval (paintbox).

Add a twist: (feature paintbox)

what if groups aren’t necessarily mutual exclusive? subintervals can overlap.
so a uniform random draw can intersect multiple features.
changing the order of the data points doesn’t matter (they came from a uniform random draw)
The reverse we prove to also be true — for every feature map, there is a corresponding feature paintbox
draw K Poisson. For each K draw a frequency of size q Beta distributed frequency size.
How to allocate features for the Indian Buffet problem?
- sizes of sub-boxes are determined by their frequency
- what about their overlap? For every combination of feature 1, there is a dependent and indpendent fraction. For every combination of feature 1 and 2, there is a feature 3 overlap and non-overlap fraction.

are feature frequency models and EFPMs the same space of distributions?

Yes all EFPFs can be represented by a feature frequency model and vice verca.

Recap

feature paintbox is a characterization of exchangeable feature models
Discussion of different classes and overlap of clustering/feature models exchangeable clusters

How do we learn a structure

most popular unsupervised approach — K-means. (easy, fast, parallizable)
Disadventages: only good for a specific K clusters.
Alternative: Nonparametric Bayes
- Modular,
- flexible (K can grow as data grows),
- coherent treatment of uncertainty.
- not efficient on large data

MAD-Bayes perspective

Inspiration
- finite gaussian mixture model.
- start with nonparameteric Bayes model, take a limit to get to a Kmeans like objective.
- kmeans — assign clusters to cluster centers, minimize distance to cluster centers.
- definitions: kmeans objective is the minimization of the euclidean distance.
- Approach: assign all data points (in parallel) to one of K clusters, measure distances. Iterate and minimize distance.
Our model
- not just learn mean, but learn full probability distribution
- Our objective: Maximum a Posteriori distribution: maximize probability of parameters given data.
analogies:
- Mixture of gaussians, k- means
- beta process – learn feature maps (?)
example: each feature is a sum of Gaussian bumbs.
if each data point belonged to one and only one cluster this returns k-means
Algorithm
- assign each data point to features
- create new feature if it lowers the objective
- update feature means
example problem: pictures of objects on tables. Want to find features (ID items on tables).
- MAD-Bayes faster
- get closer annotation, can get perfect match all objects correct and correct number of features.
what are we giving up?
- don’t enforce size distribution type
- don’t learn full posterior model. Don’t get systematic treatment of uncertainty.
parallizing
- challenge, feature choice for each data points depends on current options of features.
- chose from current list, update in next step from all new created features.

Questions

why do we care about getting the right number of means?
- alternative just chose something bigger than what we expect and not worry about the ‘dust’ of small things that get assigned their own clusters.
- reply: sometimes more aesthetically appealing not to cap. species discovery — get common ones first, have to look a lot to find infrequent ones.

Posted in Seminars | Comments Off

Protected: Hao practice talk

Posted on February 3, 2014 by admin

Posted in Project Meeting | Comments Off

Sunday 02/02/14

Posted on February 2, 2014 by admin

10:00a – 7:30p, 9:50p – 1:40a

Literature

need an article for journal club in 2 weeks.
Elife article on transcription imaging from Tijan lab.
Feng Zhang lab optical control of transcription.
- Optical control of transcription. Nature, this was published 7 months ago. Why on earth is the pdf STILL labeled ‘Not Final Version’?
These might be a better bet:
- HoxD switch paper
- HoxA regulation paper: Clustering of tissue-specific subTADs

Ph project

Data analysis

start analysis of PhM data on Cajal.
getting images
There’s got to be a better way than this
- A.dax + B.ini/B.xml -> Alist.bin/Aalist.bin + Apars.txt
To recover the ROI we do
- Alist.bin -> Apars.txt -> B.ini/B.xml. Read ROI.

Working on manuscript revisions

Thoughts on manuscript organization

[Some thoughts on the side: Maybe this could be figure 1 by itself as something simple and short. It’s a less complicated experiment I think then some of the later stuff, but I think it’s also a clear and solid message, and one that really takes a direct debate in the field.
Maybe we could combine figures 5 and 7? They show the mutation affects chromatin binding (at some loci) and gene expression (at some genes). These are absolutely critical experiments, and I think the data is fully convincing. But it’s not really blow-you-away data, the reader anticipates these two effects, and the effects we find are convincing, but small (especially with the Fig 7). So maybe we don’t want to focus so much attention, especially on the expression data, as its own figure — as a final figure I think it’s not a strong punch. I think this is impart a consequence of the cell variability – a reasonable fraction of the total cells in several of the mutant transfections are expressing pretty low levels of the protein, and we systematically don’t pick these guys to image. But we have no way to systematically exclude them in transcriptome profiling (unless you can stain with flag and sort with FACS or something crazy, which I don’t recommend).
Or maybe make the S2 STORM fig 1, combine the multi-color STORM and the Ph-FLAG in fig2? I kinda want to try the DNA FISH immuno doubles (not sure our antibodies will work well, DNA FISH doesn’t play well with a number of antibodies/antigens), but I’m hoping the anti-FLAG at least will still work after FISH. I bet BXC and ANTC colocalize with small clusters of Ph when they are outside of the obvious PcG bodies.

Data collection

start STORM imaging of WT-Ph-Flag + Psc-647 stain
running O/N
still need to image calibration beads

Chromatin Project

Imaging new stains of ANTC in fresh cells, to correct for partial breakdown of sample.

Posted in Summaries | Tagged ANTC, chromatin, Ph | Comments Off

Journal club 12/16/13: Light activated K channel

Posted on February 2, 2014 by admin

Journal club

presented by Guisheng
In Vivo Expression of a Light-Activatable Potassium Channel Using Unnatural Amino Acids
using un-nautral ammino acids to make irreversibly inducible potassium ion channels.
not clear how this is really an improvement or has advantages over existing approaches.

Posted in Journal Club | Comments Off

Saturday 02/01/14

Posted on February 1, 2014 by admin

6:30p

Goals

break down STORM4 run DONE
Modify STORMrender savedata to save in color images, matching display (currently doing bw)
- also ensure axis square (axis image);

Mentoring

helping Guipeng realign STORM4 for new fast camera

Posted in Summaries | Comments Off

Friday 01/31/14

Posted on January 31, 2014 by admin

10:50a – 11:50p

Goals Today

Order primers for sequencing
work on Ph manuscript
finish hybes to test sample age
image F03F04 samples

Deep sequencing 2

probes to order with sequencing indices
- D12 (375kb) GTGTCGCGTCGGCCAGAAAC
- F11 (180kb) AGGACATTCGCGGCTTTCAG
- G09 (76kb) CGTCGCGTTGGATTCAAGAG
- F12 (17kb) GCGAACGGGCGAACTGTTAC
probes from Hao’s library to order with sequencing indices
- Foward primers: add NEB-universal
- reverse primers: add NEB-idnex
Jeff’s new dual indexing primers
added to primer table with NEB sequencing adapters

Some interesting literature

Ph Project

Manuscript

working on figures
Just stats:
- just clusters (no weighting)
- Ph-Flag: 552,497 unique clusters from 17 cells
- PhM-Flag: 872,640 unique clusters from 34 cells
- S2 wt cntrl: 474,832, unique clusters from 42 cells
- KS test of PhWT-flag to S2 1.2E-64.
- KS test of S2 to WT is 0.
sent revised Fig 1 to Ajaz (has group meeting again. Kingston lab seems to have group meetings quite a lot more frequently).

Data anlysis

configured new RAID drive (12 TB available).
transferred double stain data from PhWt and PhM to new drive.

Chromatin project

Cell staining

finish hybes with fresh cells of F03+F04 G01+G02

STORM

STORM of F03+F04. Calbiration spots very bright large conventional, still much tighter than last time. Further evidence that we had a cell degradation issue
I think this partially also affected the ANTC that were imaged a week before the FO3+F04 last time. Let’s repeat that too.
New stains: ANTC (1.1 uL of primary each, .5 uL of secondary). Also G05 alone, and G09 repeat (though I’m pretty sure this one was fine. Can always use repeats, and I actually made a very good batch of this probe).

Posted in Summaries | Tagged chromatin, figures, literature, Ph | Comments Off

Thursday 01/30/14

Posted on January 30, 2014 by admin

9:55a – 10:00p

Chromatin Project

cell fixing and staining

fix new Kc cells
prep for in situ
treat with RNase
stain new kc cells with F03F04 and G01G02

Ph

Data analysis

wt data has half the number of frames (54,000 vs 102,000) and half the average number of localizations
this creates a problem:
- Fewer localizations means that increasing the binSize in the clustering algorithm causes visibly apparent clusters to be split apart.
- More total localizations increases the fraction of small clusters weakly connected by chance localizations, artificially increasing the cluster size.
potential solution
- for a fixed number of clusters, if we double the number of total localizations, we should be able to sample the area at twice the coverage and still maintain the same localization density. This should also avoid spurious new localization linking existing clusters (which would happen if we didn’t change the sampling density). To halve the area (twice the coverage) we change the sampling dimension by sqrt(2). So lets rescale the binsize by the root of the localizations (or localizations per cell).

Posted in Summaries | Tagged chromatin, Ph | Comments Off

Wednesday 01/29/14

Posted on January 29, 2014 by admin

9:10a – 8:00p, 9:50p – 2:30a

Ph project

revising manuscript draft

working on abstract
refocusing discussion of PcG clusters. Need to distinguish between the sparse micron-scale PcG bodies other people have studied and the numerous, nano-scale bodies we focus on.

Notes on Ph project organization / motivation

previous work has divided PcG organization into two structural categories: PcG bodies, and dispersed nuclear signal.

It is not entirely evident the large bodies visible by conventional microscopy are relevant to gene regulation:
1. For example, many (most?) Pc target genes are found outside of these bodies in a large fraction of cells.
2. These bodies disappear in hypertonic solution, but DAPI dense and H3K27me3 dense domains remain intact. (and presumably so does silencing?)

Our work demonstrates that much of this distinction is a consequence of technical limitations of conventional microscopy, rather than a fundamental difference in organization: PcG proteins are organized in bodies throughout the nucleus. This cluster size is power-law distributed, and only the largest clusters are distinguishable to conventional imaging methods — the small ones just blur together.

We propose that clustering is an important part of PcG repressive activity, and that it is mediated in part through the SAM domain of Ph. This clustering is important not just for the formation of large PcG “bodies”, but for clusters of PcG proteins at a whole range of scales, which function in efficient silencing of target genes.
Additionally we find clustering independent loci … (are these also silenced? e.g. does expression of Ph clustering independent genes change in PcG knockdown or Ph mutant backgrounds?)

Ph data analysis

just need more data on size distribution of bodies. Can get some decent data from the 647 Ph imaging in the respective Ph and PhWt backgrounds (even though the 750 didn’t work well). Imaging of 647 flag indicates high transfection rate (just variable levels). So we should just be able to use nuclei from the slide straight. Be
Starting writing ClusterStats function for analysis of Ph data. Each localization gets assigned a cluster and we report that cluster’s area

Team project / mentoring

reviewing slides for Hao’s group meeting, 11a-2:30p

Chromatin Project

Cell staining

Finish hybridizations for F03+F04 and G01+G02 cells.

STORM

start imaging F03 + F04 as a contiguous ~200 kb region of BX-C
dots seem spread out. These cells were also fixed a while ago (~1/6), I think we shouldn’t keep them this long.
plan to repeat staining with fresh fixed cells tomorrow

Posted in Summaries | Tagged chromatin, Library2, Ph | Comments Off

Tuesday 01/28/14

Posted on January 28, 2014 by admin

9:15a – 10:50a, 12:00p – 10:15p

Chromatin Project

Chromatin Analysis

Data Analysis

Continued analyzing D12 data from 10-26-13 up through image 26 (still plenty more dots to image).
F06 analyzed up to but not including image 0_2.

Developing analysis pipeline

updated CC to record locus name
mI is not computed correctly — this value changes depending on my zoom.
found origin of bug in mI computation: multiplied xy by pixelsize before subtracting centroid, so now the two distance measures were in different units. Will need to recompute for current data.
fixed bug in mI computation in ChromatinCropper.
running script to fix mI computation in all currently analyzed data.
if this looks better, let’s rerun it and record the new values into saved data structures.

Working on updated drift correction

changed method of calculating drift error
changed method of determining the guide bead
added
changed function name to function naming conventions (now called FeducialDriftCorrection.m). Added original function to a Depreciated folder — need to fade out function calls to this form feducialDriftCorrection.m

New cell staining

F03 + F04 (both P1 / A647)
G01 + G02 (both P1 / A647)

project2

computing number of hamming codes with 4 ones for 32 hybes. (1240)
trick is to just use initial codewords with 4 or fewer ones before assigning parity bits
scaling up to 5 still yields memory errors. Because matlab won’t matrix multiply logicals.

Posted in Summaries | Tagged chromatin, Library2 | Comments Off

Protected: chromatin analysis in progress

Posted on January 28, 2014 by admin

Posted in Chromatin | Tagged chromatin, Library2 | Comments Off

Search for:
September 2025

M T W T F S S

« Aug

1 2 3 4 5 6 7

8 9 10 11 12 13 14

15 16 17 18 19 20 21

22 23 24 25 26 27 28

29 30
Categories
- AP patterning (13)
- Blog (1)
- Chromatin (88)
- Conference Notes (72)
- Fly Work (54)
- General STORM (25)
- Genomics (134)
- Journal Club (22)
- Lab Meeting (66)
- Microscopy (79)
- Notes (1)
- probe and plasmid building (58)
- Project Meeting (3)
- Protocols (13)
- Research Planning (74)
- Seminars (21)
- Shadow Enhancers (59)
- snail patterning (40)
- Software Development (5)
- Summaries (1,412)
- Teaching (9)
- Transcription Modeling (40)
- Uncategorized (10)
- Web development (19)
Links
Tags
analysis cell culture cell labeling chromatin cloning coding communication confocal data analysis embryo collection embryo labeling figures fly work genomics hb image analysis image processing images in situs Library2 literature making antibodies matlab-storm meetings modeling MP12 mRNA counting Ph planning presentation probe making project 2 project2 result results sectioning section staining shadow enhancers sna snail staining STORM STORM analysis troubleshooting writing
GitHub Projects

September 2025
M	T	W	T	F	S	S
« Aug
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Intro

Kingman Paintbox

Add a twist: (feature paintbox)

are feature frequency models and EFPMs the same space of distributions?

Recap

How do we learn a structure

MAD-Bayes perspective

Questions

Literature

Ph project

Data analysis

Working on manuscript revisions

Thoughts on manuscript organization

Data collection

Chromatin Project

Journal club

Goals

Mentoring

Goals Today

Deep sequencing 2

Some interesting literature

Ph Project

Manuscript

Data anlysis

Chromatin project

Cell staining

STORM

Chromatin Project

cell fixing and staining

Ph

Data analysis

Ph project

revising manuscript draft

Notes on Ph project organization / motivation

Ph data analysis

Team project / mentoring

Chromatin Project

Cell staining

STORM

Chromatin Project

Chromatin Analysis

Data Analysis

Developing analysis pipeline

Working on updated drift correction

New cell staining

project2

Categories

Links

Tags

GitHub Projects