Good afternoon, everyone, and thank you for joining us today.
My name is Wayne Matten and I am joined today by Yumi Jin, we call him Jimmy.
It is important that Jimmy is here because he is the developer of the GRAF software.
He will be answering questions in the questions pod.
If you have some questions that you want to ask of me or Jimmy, if it is about the GRAF
software, I recommend emailing Jimmy.
Our emails are there.
matten@ncbi.nlm.nih.gov jinyu@ncbi.nlm.nih.gov
This is what the plan is for today.
I will introduce GRAF briefly.
Talk about why it was created and how it works at a superficial level.
And mostly do some demonstrations of how to use GRAF.
The two main functions of GRAF are to identify kits and closely related samples in your data.
And a relatively new feature is the ability to look at populations and determine, estimate
subject ancestries within those data sets.
That is the basic plan today.
Everything will be in slides.
At the end I may go to the webpages and show you how some of those works based on a CGI
that Jimmy and folk set dbGaP have developed.
Why was GRAF created?
I mentioned there are two main components but the original one was essentially created
to help deal with the dbGaP data in the sense that dbGaP is a huge database.
There are hundreds of studies now.
Incorporating well over 1 million individuals.
As you know, must GWAS, Genome Wide Association Studies make assumptions that individuals
in the study are not closely related.
One function of GRAF is to add as a curation tool to pick out duplicates and very close
relatives.
Identify those and deal with those as you see fit for your analyses.
Part of that process and I put these terms here because they may come up in later slides,
the subject sample mapping files and pedigree files are pretty commonly known about files.
Those get used in that first step in looking at relatedness.
The second function we call GRAF-pop, or population, was added fairly recently.
This function helps estimate subject ancestries.
Basically it deals with the problem of the assumption of having individuals that are
not related, you can pick those out.
And it also deals with the issue of using different genotyping platforms.
It normalizes that problem as well.
As part of the population addition, we have started adding links on dbGaP pages.
This is a screenshot of the dbGaP Advanced search page.
There is a link there called population graph.
If you were to open this first ARED study, there would be a link within that first page
of the study that is called ancestry components.
Those will both take you to the same graph.
Which would look something like this.
This is a very simplified data set here with mostly Europeans.
But this is just a teaser of what we will talk about later when we start talking about
the GRAF population function.
Those graphs are appearing more of the studies.
Very briefly how does GRAF work ? It will be at a superficial level.
We have detailed questions about this, there is a publication in PLOS one that came out
last year.
The link for that is on many of the pages I will show you.
Including the software page where you can download the software.
I highly recommend that paper.
To give you some fundamental background.
Mostly to introduce terms that will be on some slides.
A key component here in identifying the duplicates and close related subjects is a set of preselected
fingerprint SNPs.
Right now there are about 10,000.
I think there may be plans to add to that.
Jimmy is nodding his head, yes.
Those are very carefully selected.
They represent many of the common genotyping platforms.
They were selected to be bi-allelic and have relatively high allele frequencies.
They were selected to avoid linkage disequilibrium as much is possible.
They are well separated.
They happen to cover, well represent the 22 autosomes.
Also there is no strand information needed in the data set.
There are no complementary alleles.
No A/T or G/C pairs.
You don't have to worry about strandedness in your data.
That is a key component.
There are two statistical metrics.
All genotypic mismatch rate.
AGMR.
And the homozygous genotype mismatch rate.
Those definitions are fairly self-explanatory there.
You will be seeing those on some of the graphs we look at.
That is for the the relatedness this aspect of GRAF-pop.
The population component is a bit more complicated.
And very cool.
I will not be able to go into the math here because I don't understand it that well.
At a superficial level, these subjects get clustered using genetic distances from some
reference populations.
And those clusters get projected onto a plane and based on their barycentric coordinates
you can look at the position of those clusters and get some idea as to where they fall relative
to the reference populations.
Also, part of the process is to develop these statistical metrics, genetic distance one
through four.
We will look at how plotting those in different ways will give you different views of these
planes and almost rotate the planes.
We will be looking at those plots in a little bit of detail as we go along.
Okay, one of the main questions is how will it help you?
You can download the GRAF-pop software.
And run that locally on either your own data sets or the dbGaP data or other data.
One possible use case is in addition to trying to pick out some errors in your data, you
can also answer questions like, are the subjects that I'm wanting to add to a new project,
have they been represented in other studies?
You want to be aware of that.
And then another good use is if you are submitting data to dbGaP, the curators can now send you
a URL based on the CGI that Jimmy and others at dbGaP developed.
And these URLs will show you the plots themselves.
So you can get some checks on the data you are submitting before it is finalized and
that is very helpful in the process.
Those links will be similar to some of the ones that are showing up on the web pages.
Where can you get the GRAF package?
If you do an Internet search for dbGaP software, one of the first pages will be this page of
software.
You can download the zip file there.
Compiled only for GNU/Linux, a link to the paper I was talking that to give you more
details until how GRAF-pop works.
Are there any plans to compile this on other platforms or make the source available?
No immediate plans to do that.
We will see what happens in the future.
I will give you a quick look at what is in that package.
I will not run anything on the command line today.
This will give you an idea.
There are two very good documents.
The readme on top is enough to get you going.
Running this on the command line is very simple.
The pop documentation was added later because that describes the population graphing function.
Then you have four programs in there/ Two Perl scripts to do the plotting.
And two types of graph programs.
If you wanted to run GRAF-pop, -pop is an option on the graph commandline.
There is the set of fingerprinting SNPs.
An example commandline could be as simple as this, graf -plink affy_hapmap -out aff_hapmap_rels.txt
with the option plink, most of you are probably familiar with what a plink set is.
It includes these three file types, .bed, .bim, .fam.
You specify the plink set and simply output a table of those relationships.
There are other options if you need to have a sample subject mapping file.
Because your sample and subject IDs may differ.
You can add that or you can add a pedigree file whatever your data set requires.
After you run that first GRAF run to get that output file, you can plot that using one of
the Perl scripts.
Plot graph Perl.
It put that output from the previous file and that will create a PNG file.
Here's an example of what the PNG out might look like.
That is enough for the commandline.
I want to jump into some results to explain what some of these plots mean.
The top row represents the relatedness functions.
And after looking at a few of those we will look at some of the population scatterplots
as well.
In the simplest case here, we are plotting on the X axis the all genotype mismatch rate
against some number of pairs on the Y axis.
In this case we are looking at duplicate samples and monozygotic is twins.
In other words duplicates.
You can adjust the number of pairs.
If we set that number of pairs to 50, we basically zoom in on the graph.
And so you can start to pick out some things that would be good to know about.
For example, the gray at the top of the identical pairs shows you that there are some that are
not reported in this data set.
I a should also point out the vertical lines are sort of rough estimates of where identical
would fall out.
Where full siblings would fall out.
PO stands for parent offspring.
D2 D3 are 2nd degree and 3rd degree relatives.
And Un is unrelated.
In this data set it looks like there is some reported duplicated pairs.
That might actually be first-degree relatives based on this all genotype mismatch rate.
You should definitely be well below 25% if you want to identify something as identical
or monozygotic twins.
We also see out here a couple that were reported as monozygotic twins.
But the rate is very high.
Around 55%.
Which would indicate in fact they are probably unrelated.
That is the sort of checks this plot can give you.
Again, that was the AGMR against the number of pairs.
You can also plot the HGMR on the X axis.
And the number of pairs on the other axis.
Here a looking at full sibling, parent offspring, second-degree, third-degree, degree of relatedness
is decreasing as you go to the right.
And if we look at that in some detail, we see that some of these parent offspring pairs
are not reported.
Same for the full sibling pairs.
So the gray are they not reported.
It looks like there is some second-degree relatives not reported.
And also in the unrelated, it looks like they thought there were monozygotic twins or full
sibling pairs, but they are actually probably unrelated.
We can zoom in on that to see it better.
So you see the color coding for the monozygotic twins.
Probably unrelated based on the GRAF calculations.
If you plot the HGMR versus the AGMR , with AGMR on the y-axis, you get this scatterplot.
Here we have parent offspring.
Full sibling, second 2nd degree and 3rd degree relatives.
These elipses here represent where 95% calculated would occur in those particular types.
You see the full sibling is pretty good.
But one good reason to do this is you can start to pick out outliers easier in this
type of plot.
Let's move on to the GRAF population examples.
This is how you can infer subject ancestry.
As an example we will look at some real data here.
Here by default what this plot is on the X axis, the genetic distance one, on the y-axis
GD two.
Here is the plane I was talking about.
On this vertice is European.
African in lower left.
East Asian in lower right.
These are color-coded by the reported ancestry.
And then the positions are what are calculated by GRAF.
You can see there are some mixtures in here.
For example, in this region here, self-reported Hispanics and Asian Indians are overlapped
in this area.
And on this axis, self-reported Hispanics overlap with African backgrounds.
There is a pretty clear cluster of Asian Indians here.
This is how it looks in that particular plane.
But if you then change the plot, I will show you how to do this on the webpages, you put
GD4 on the X axis, then you begin to separate out some of these populations.
For example, Hispanics are nicely separated now from the South Asians.
Also, it separates these two Asian populations a little better.
That is the real advantage of going to that particular type of plot.
One more example of that.
Here, we have a mixture of South Asians with Hispanics.
This is the GD one versus GD two.
And if we plot that against GD four, you begin to get a nice separation of those.
Again, another good example of plotting and using these different metrics and how they
help you tease apart the ancestries.
This let's go over to the webpage and see if I can demonstrate some of this.
I'll go over to the advanced search page.
Here is an example of the population graph link.
Click on that.
Then you see the plot we were looking at before.
You also have some options over here to change what you are looking at.
Plot GD 4 and that type of thing.
What I wanted to show was so here you have options to change the plotting.
Right now this is the HGMR plot against the number of pairs we saw earlier.
All you have to do to get this scatterplot plot is click on the HGMR plus AGMR button
and that will re plot that.
In terms of the population graphs, here is one where we are just looking at the plane.
We have some mixtures here.
Maybe you can tease that out better by plotting with GD 4 on the Y axis.
Check that and it redraws that.
You can do things -- it depends on your access to the data as to how many options you have,
on the URLs that the curator he might send you.
You have access to that data so you will get more options.
If you don't have the axis, you are more limited in options.
I think that gives you a pretty good idea of what you can do with the plotting functions.
And that was most of what I wanted to show you.
Were there any questions, Jimmy that came in that you think would be appropriate for
everybody to see?
So the question is, could GRAF be downloaded onto android?
So it is possible, but it has not been tested.
That is a good thing for us to test out.
Any other questions?
Remember, there will be a document on that materials directory.
And there will be a link to the materials on the courses and webinar pages as well.
Feel free to email myself or Jimmy at those email addresses on the second slide.
For general questions you can send emails to the info address, info at NCBI.NLM.NIH.gov.
If we have any -- over have no the questions, I will wrap this up.
Thanks again for showing up.
Stay tuned for upcoming webinars.
Không có nhận xét nào:
Đăng nhận xét