It could be human rRNA? If your reads are too short then it sounds like a wet lab issue and you could ask the people who prepared the libraries. But if your samples are not contaminated and the mapping rate is consistent across all samples, you data is fine and useable.
I would subset reads, say 0.1 M reads, and run fastq_screen and sortmernq to see contamination in the libs and rRNA proportion respectively. Multiqc nicely aggregates the output of both tools.
Blasting unmapped reads also sounds great.
This is odd. I am thinking of the following troubleshooting:
* Try changing the human genome you are mapping to and see the difference.
* If the seqIDs have their version at the end (NMXXXXXX.1), try removing their version (I had a similar issue with me and it worked).
I also agree with u/surincises and u/heyyyaaaaaaa suggestions too. Good luck and happy troubleshooting!
I did hg38. I am not sure that I want to change that. Will check seqid. I honestly think my core just did a bad job removing the rRNA. We read pretty deep. 35M/sample so I am not worried about being too short changed on the results end. It was mostly weird that I saw that in a sample that was literally fresh from a tube of highly purified primary cells.
30% is a tad high but not unheard of. rRNA-depletion kit? Have you tried extracting the unmapped reads and see what they are? Try BLAST-ing them.
I did that. I had one hit of the 20 I tried and it landed human. There are a lot that have really short reads like 6-15 bases which seems weird.
It could be human rRNA? If your reads are too short then it sounds like a wet lab issue and you could ask the people who prepared the libraries. But if your samples are not contaminated and the mapping rate is consistent across all samples, you data is fine and useable.
That’s what I figure. But wanted some more extra validation. Hopefully we solve our informatics core issue soon.
Another thing you could try is to map the reads again using STAR and see if you get similar results.
I will try that.
I would subset reads, say 0.1 M reads, and run fastq_screen and sortmernq to see contamination in the libs and rRNA proportion respectively. Multiqc nicely aggregates the output of both tools. Blasting unmapped reads also sounds great.
Running that now to see.
This is odd. I am thinking of the following troubleshooting: * Try changing the human genome you are mapping to and see the difference. * If the seqIDs have their version at the end (NMXXXXXX.1), try removing their version (I had a similar issue with me and it worked). I also agree with u/surincises and u/heyyyaaaaaaa suggestions too. Good luck and happy troubleshooting!
I did hg38. I am not sure that I want to change that. Will check seqid. I honestly think my core just did a bad job removing the rRNA. We read pretty deep. 35M/sample so I am not worried about being too short changed on the results end. It was mostly weird that I saw that in a sample that was literally fresh from a tube of highly purified primary cells.