It's really hard to decide what data is acceptable, especially when user has permission to insert HTML content through form. [...] However, problem can be solved, and quite easily. Almost a year ago I was reading some random blog when I find out about HTML Purifier. Basically, it's library which can filter and fix any HTML.
He gives an example - running a web scraping tool against a site with malformed HTML. By running it through the HTML_Purifier package first, the errors were corrected and the "more correct" HTML source could be parsed easily. The package also helps to protect from XSS attacks via a whole set of filters included by default.