Blog of (former?) MySQL Entomologist: 2014

Sunday, September 28, 2014

Fun with Bugs #34 - Who has fun to verify your bugs (based on 5.6.21)

This time I decided not just to review all community bugs fixed in recent MySQL 5,6 GA release, 5.6.21, but also to mention who reported them (Morgan does this recently) and, most important for this post - who verified them.

As I've explained long time ago, verification is an important part of a bugs life cycle in MySQL. We need some MySQL engineer to check the bug and make sure there is a repeatable test case for it or it is at least clear what is the real problem behind the report. Bug must be "Verified" (confirmed internally) before developers start to work on the fix.

Usually this engineer is somebody from the bugs verification team where I worked for 7 years. But we later agreed that if a bug is reported by MySQL Support Engineer or Developer, it's accepted to set it to "Verified" immediately, without waiting on somebody else from bugs verification team to pay attention and double check.

So, while reporting and fixing are, surely, important stages, verification is also important as it helps developers to concentrate on real bugs only and isolate them from most false positives and user complains. I've noted (at was surprised) that it seems more than 45% of all bugs recently fixed were actually verified by one my former colleague in Oracle since 2011 (who used to work as a regular support engineer during EMEA hours mostly), Umesh Shastry (that you may know as "Umesh Umesh" in the bugs database for the reason only him and Oracle probably knows about).

This is what we have in MySQL 5.6.21 with regards to bugs from http://bugs.mysql.com:

Bug #70641 - 5.6 partitions use much more memory than 5.1. Reported by famous Yoshinori Matsunobu and verified by Umesh (no details shared in public). As a side note: I think it was me who introduced a habit not just to write "verified" in a comment, but also to share all the details and steps, just copy/pasting outputs from mysql command line client or terminal. (I've got this habit while following famous Oracle guru Tom Kyte.) So, I am always sad when details about the exact steps performed and versions used are no shared...
Bug #73650 is still private and we can see no details. We can only rely on public description in the release notes: "An ALTER TABLE ... ADD FOREIGN KEY operation could cause a serious error." I know what the bug is about as I've reported a duplicate of it based on Percona Server bug report (that is public as far as I remember). many community users know all the details about the crash (let's use proper words, it was a crash). Bug is fixed. But we see nothing, neither about test case, nor about the fix (even though it's visible when one checks what changed since last version in bzr/source code repository). If you ask me, this Oracle policy is applied in a wrong way to open source software, to put it into polite words. One day I'll write long text about this approach and historical reasons for it...
Bug #72453 - InnoDB Memcached Plugin "gets" command. Reported by qixiu xiao and verified by Umesh.
Bug #72586 - memcached daemon crashes and restarts on second ADD operation. Reported by David Schwartz and verified by Umesh.
Bug #72594 - InnoDB in-place ALTER failures block future ALTERs. Reported by famous code contributor, InnoDB guru and great conference speaker, Jeremy Cole. Recently we see his name in all release notes. This bug was verified by Miguel Solorzano, who was my team manager and trainer since day one in MySQL. Nice to see that he still not only manages a successful (and key!) team in Oracle MySQL Support, but also finds time to do real work.
Bug #72137 - inaccurate handling of srv_activity_count. Reported by famous InnoDB guru Inaam Rana and verified by Sveta Smirnova. There is no much need to explain who is Sveta and what she does - she speaks, writes and works for herself, and her work is recognized and awarded in Oracle. Check her sessions at MySQL Connect 2013, if you wonder, or read her book. I am just happy to find out that she still works on bugs processing.
Bug #67718 - InnoDB drastically under-fills pages in certain conditions. Reported by Jeremy Cole and verified by Sinisa Milivojevic. I am happy to call Sinisa my teacher, in diplomatic skills if not in bugs processing or reading the code. If any of my former colleagues from MySQL asked what I am doing in Percona, I'd reply that I play a role of Sinisa here, "ideal" Sinisa as I wanted him to be, not the "real" one. I am not good enough in playing the role of real one, as I lack diplomatic skills, to begin with.
Bug #71695 - Concatenation of mysqlbinlog output does not work with GTID. Reported by Sadao Hiratsuka and verified by Umesh.
Bug #70327 - Assertion error when setting future binlog file/pos with semisync. Reported by Yoshinori Matsunobu and verified by yet another my former colleague since MySQL AB times, Mikiya Okuno. Nice to see many MySQL engineers contributing to bugs processing, not just members of bugs verification team.
Bug #69873 - Replication stop with "Error in Xid_log_event: Commit could not be completed". Reported by Alexander Du and verified by Umesh.
Bug #71047 - shutdown hang if semisync is enabled and slave abort. Reported by a great community contributor to MySQL, zhai weixiang. Verified by Umesh.
Bug #72788 - HASH_SCAN seems broken: Can't find record in 't1' Error_code: 1032. Reported by Shane Bester, who probably works more on community bugs reports than all other engineers outside bugs verification team together. As I mentioned more than once, Shane is a QA department for MySQL by himself, and it had always been the case, no matter what company he worked at. Bug was verified, surprise surprise, by Umesh.
Bug #72313 - stop sql_thread, start sql_thread causes a trx to log with a different GTID. Reported by Santosh Praneeth Banda. Not sure who verified it, but probably the reporter himself.
Bug #72578 - Duplicate slave server_uuid in replication forum gives no specific error message. Reported by Hartmut Holzgraefe, former my colleague at MySQL AB and Sun. He helped me a lot during my first days in MySQL and he still reports a lot of bugs. Bug was verified by Umesh.
Bug #72901 - Keep CLIENT_REMEMBER_OPTIONS for compressed connections. Bug was reported by Ramil Kalimullin, whom I miss a lot since MySQL developers conference in Riga, 2008 (as many others, it was last team meeting or conference for me till I quite from Oracle). He verified the bug himself it seems.
Bug #73324 - Client flags are getting discarded due to incorrect assignment operator. Reported by Sujatha Sivakumar. She then verified and fixed the bug herself it seems. I am always happy to see Oracle employees and customers using public bugs database. You know why, I hope...
Bug #72610 - Out of memory replicating using RBR and slave having different data types. Reported by Jean-Francois Gagn and verified by Umesh.
Bug #70429 - list handling incorrect in mysql_prune_stmt_list(). Reported and verified by yet another former colleague from MySQL Support, Andrew Dalgleish.
Bug #47641 - LIKE doesn't match when cp932_bin/sjis_bin collations are used. Reported and verified by Mikiya Okuno.
Bug #73507 - On EL7, installation of MySQL from RPM packages could fail if postfix had previously been installed using yum. It is still private, even though I fail to see what "security implications" it may have, based on that description, comparing to any other public bug report about upgrade being screwed up. There are many, public ones...
Bug #72066 - mysql_upgrade duplicate key error for mysql.user for 5.5.35+, 5.6.15+, 5.7.3+. Reported and verified by Jesper wisborg Krogh, again, former colleague who still works in Oracle.
Bug #67088 - When the general_log is a socket and the reader goes away, mysql quits working. Reported by Rolf Martin-Hoster and verified by Sveta Smirnova.
Bug #71433 - recreate+analyze OPTIMIZE TABLE T and online ALTER TABLE T may deadlock. Reported by my colleagues from Percona and great code contributor, Laurynas Biveinis. Recently he reached 100 as a number of public MySQL bugs reported. Time to review and update my list of famous bug reporters, it seems... Bug was verified by Umesh.
Bug #72547 - Query cache not invalidated on cascade delete if db name has special symbols. Reported by Elena Stepanova, whom I met first in Riga, 2008. He contribution to QA in both MariaDB and upstream MySQL is outstanding - we can see her name in every issue of community release notes by Morgan. Bug was verified by Umesh.

So, out of 24 community bug reports 11 were surely verified by Umesh Shastry. This is almost 46% of them all. Now you know who seems to play the most important role in community bugs processing. I hope I am not the only one who noted that I am now successfully replaced in my former main job role, by a single person (whom I trained to some extent while we still worked together in Oracle).

One can probably make more conclusions based on above. For example, it looks like that most "community" bugs fixed are actually from former or current MySQL support engineers and developers, not just "users"... But let me stop on highlighting names at the moment. I am sure that some of these names were rarely mentioned in public before.

Saturday, May 31, 2014

Fun with Bugs #33 - bugs fixed in MySQL 5.6.19

It seems Oracle released MySQL 5.6.19 yesterday. So, it's time to check what community bug reports are fixed there.

Let's start with InnoDB. We have the following bugs fixed:

Bug #72079, "Upgrade from 5.6.10 to 5.6.16 crashes and leaves unusable DB". Honestly I had not noted this bug, even though I upgrade several instances on Windows to each and every release since 5.6.8. Probably because I had no need to use FTS indexes till recently...
Bug #71014, about two extra memset calls that are now removed. I hope the patch bug report provided was just used.
Bug #69641, "Crash when enabling innodb_table_monitor (assertion: ib_table->stat_initialized)". Good to see one more crash fixed.
Bug #62534, "off by one error in innodb_max_dirty_pages_pct logic". Reported by Domas and fixed by Facebook long time ago. But it seems current fix from Oracle is somewhat more complex than just replacing ">" with ">="... Maybe my comment about Informix having decimal (instead of integer) as a data type for similar variable helped :)

The following replication bugs are fixed in 5.6.19:

Bug #71070, "kill of truncate table will lead to binary log written while rows remains". I am really happy to this it fixed finally!
Bug #71376, "restart of slave instance fail in GTID replication if we use replicate-ignore-db". Also great to see this fixed. Note that to add more confusion, Bug #71326 is mentioned instead as replication bug that is fixed, but this is actually a MySQL Workbench bug. I tried to guess what other community bug it could really be, and ended up with this one. Let's home release notes will be fixed soon accordingly.
Bug #71179, "CREATE TABLE .. SELECT produces invalid structure, breaks RBR". Replication is just one of the areas affected though. Let me quote details of the fix (added by Laurynas Biveinis, he does this for each and every community bug fixed. I wonder why Oracle does NOT care to do that themselves?):
```
  This fix changes the DIV operation implementation to correctly
  evaluate the display width of its result. We check if DIV's
  results estimated width crosses maximum width for integer
  value (21) and if yes set it to this maximum value.
```

Bug #70891 is also listed as fixed, but it is still private, so we can only guess what exactly it was about or just trust release notes...

On top of that, few old bugs from other categories were also fixed:

Bug #66803, "Allow compilation using Clang in C++11 mode". Clang is becoming popular. Still not sure why it took so much time to fix the problem when original bug report had a patch.
Bug #71089, "CMake warning when generating Makefile". No more warning for new CMake versions.
Bug #51840, "infinite hang with 100% cpu usage with load data local and import errors". Reported by Shane more than 4 years ago, it was fixed only in 5.6.19. This fix may be the only really good reason to upgrade to 5.6.19, by the way!
Bug #69683, "Optimize after a delete returns error 0 can't get stat of MYD file". Actually, it was a problem for MyISAM tables with .MYD files larger than 4GB on Windows. If you care about Windows and MyISAM (really?), then you have one more good reason to upgrade.
Bug #69684, "The update and select behave differently under the same conditions". Probably the only fix from the Optimizer team in this release.

That's all, folks! Just 13 community bugs fixed over 2 months, according to the official Release Notes (that has a bug...). I am not impressed and I see a reason to upgrade only if you care about top INSERT performance for InnoDB, or if you use GTID and/or row-based replication setup, or may get LOAD DATA LOCAL statements for files with uncontrolled content. Other fixed are for some rare/corner cases or compilation problems, and I'd happily wait for 5.6.20 to fix some more bugs I so much care about...

Wednesday, April 2, 2014

Fun with Bugs #32 - some bugs I've reported in March

Comparing to the previous month I was not really productive bug reporter in March 2014 (partially because I spent few days at a nice FLOSS UK conference where I tried to give a session on PERFORMANCE_SCHEMA). Just 12 reports, of them 5 documentation requests are already closed. There are some interesting reports among other 7 to write about though.

But let me start with good (or not entirely good) news about my older report, Bug #71858 (easy way to crash MySQL with single SELECT statement, at least on Windows and maybe everywhere). The bug seems to be fixed by Oracle, but the fix will appear in 5.5.38, 5.6.18, and 5.7.5, not in recent official releases.

Off all the documentation requests I'd like to highlight this one, Bug #71916, "Manual does NOT explain locks set for UPDATE ... WHERE PK='const' properly". I was so much surprised to see even basic details about InnoDB locks ad monitoring them not entirely or correctly documented thatI even decided to submit a talk about this of next Oracle's MySQL Connect event (and I tried once, but then probably clicked at some wrong place and my submission disappeared).... Do you think this makes any sense, or should we better just wait for MySQL Documentation team to explain it all eventually?

Another important cases of the "missing manual"are presented by my Bug #72047, "Manual does not explain names for indexes generated for FOREIGN keys properly", that I've reported while trying to study what happens when MySQL creates index for the foreign key implicitly and then InnoDB "switches" to a new suitable index that is explicitly created later (more on this to be written one day...), and by the latest my report, Bug #72133, "Manual does not provide enough details about InnoDB FTS implementation".

While working on a (usual) customer issue about ibdata1 growing even with innodb_file_per_table=1 in MySQL 5.6 I've noted more serious missing detail about FTS implementation represented by a sever bug report, Bug #72132, "Auxiliary tables for InnoDB FTS indexes are always created in shared tablespace". Had you even noted tables like these?

mysql> select name, space from information_schema.innodb_sys_tables where name like 'test/FTS%';
+----------------------------------------------------+-------+
| name                                               | space |
+----------------------------------------------------+-------+
| test/FTS_000000000000002e_0000000000000059_INDEX_1 |     0 |
| test/FTS_000000000000002e_0000000000000059_INDEX_2 |     0 |
| test/FTS_000000000000002e_0000000000000059_INDEX_3 |     0 |
| test/FTS_000000000000002e_0000000000000059_INDEX_4 |     0 |
| test/FTS_000000000000002e_0000000000000059_INDEX_5 |     0 |
| test/FTS_000000000000002e_0000000000000059_INDEX_6 |     0 |
| test/FTS_000000000000002e_BEING_DELETED            |    35 |
| test/FTS_000000000000002e_BEING_DELETED_CACHE      |    36 |
| test/FTS_000000000000002e_CONFIG                   |    37 |
| test/FTS_000000000000002e_DELETED                  |    33 |
| test/FTS_000000000000002e_DELETED_CACHE            |    34 |

...

Not only they are not documented anywhere outside this old article, but those *_INDEX_* ones are always created in shared tablespace! I consider this a server bug to fix, not just something to document.

Probably I am not the first one, but I consider this mysql client problem, Bug #72108, related to libedit now used by Oracle mysql client instead of readline really annoying... Some users even claim that other versions/forks of MySQl are already good enough just because they still use readline :)

Finally, one of my reports ended up as "Not a bug": Bug #71978, "Server silently allows to set PERFORMANCE_SCHEMA as default_storage_engine". I've got a kind and polite offer like this:

"Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on how to report a bug at http://bugs.mysql.com/how-to-report.php"

that I probably has to be insulted by, as a person who had not only reported many dozens of good "Verified" valid bugs, but also the one who had a chance to use the same boilerplate standard reply for 7+ years and hardly used it more than maybe 10 times over these years... What's even more interesting, for a bug that was "Verified" by Oracle engineer (and my former manager in bugs verification team) originally I've got explanations (that I am thankful for) and kind offer:

"So your request is at best a feature request (and feel free to re-open as such). But I see little value in implementing this, specially since you're getting a meaningful error message."

So, I did that. I still wonder why the decision to make feature request out of it was not made without original reporter involved...

Sunday, March 30, 2014

Fun with Bugs #31 - what's new in MySQL 5.6.17

MySQL 5.6.17 will probably be announced loudly at or immediately before Percona Live MySQL Conference & Expo next week. But official release announcement via email was made on March 28, release notes and binaries to download are already available, so why not to check them carefully to find out what to expect from this 8th minor release of MySQL 5.6 GA...

First of all, it seems Oracle still does not hesitate to introduce new features and behavior in the process. Just check these major changes:

Starting with 5.6.17, MySQL now supports rebuilding regular and partitioned InnoDB tables using online DDL (ALGORITHM=INPLACE) for OPTIMIZE TABLES, ALTER TABLE ... FORCE and ALTER TABLE ... ENGINE=InnoDB. Rebuilding tables (to defragment them and free some disk space mostly) is probably as popular kind of ALTER as adding new columns or changing indexes, so it's great to see a built-in way to do these ALTERs without blocking DML (well, assuming the table does not have foreign keys at least). Previously one had to rely on third party tools like pt-online-schema-change for these cases.
"The AES_ENCRYPT() and AES_DECRYPT() functions now permit control of the block encryption mode and take an optional initialization vector argument ... These changes make statements that use AES_ENCRYPT() or AES_DECRYPT() unsafe for statement-based replication and they cannot be stored in the query cache." So, we should expect more warnings now for STATEMENT-based replication setups, but has more secure implementation now. Note that having the details of call to these functions visible in the logs make them less secure, as pointed out in recent Bug #72158.

Oracle also continues to push deprecation of features actively:

"The deprecated ERROR_FOR_DIVISION_BY_ZERO, NO_ZERO_DATE, and NO_ZERO_IN_DATE SQL modes now do nothing. Instead, their previous effects are included in the effects of strict SQL mode (STRICT_ALL_TABLES or STRICT_TRANS_TABLES)." - it seems it's time to check all the code and remove these modes, as they may just disappear in 5.6 even before MySQL 5.7 GA with this kind of activity.
"The IGNORE clause for ALTER TABLE is now deprecated and will be removed in a future version of MySQL." - as this deprecation happened in 5.6.x minor release probably the clause will be removed well before 5.8. So it's not like your code is safe for next 2+ years.

Speaking about fixes for public bugs (these I mostly care about), I consider the following as most important:

Bug #71197. "mysql slave sql thread memory leak". I've pointed out this bug this week (and not for the first time) as a possible explanation for the OOM conditions in big replication cluster. It's really nice to see it fixed!
Bug #70669. "Slave can't continue replication after master's crash recovery", even with sync_binlog=1. This is now fixed.
One more fix in replication: Bug #68429 - "When running the slave with --slave-parallel-workers at 1 or greater, setting --slave-skip-errors=all caused the error log to be filled with with instances of the warning Slave SQL: Could not execute Query event. Detailed error: ;, Error_code: 0."
Bug #69680. "Auto_inc value not properly generated with RBR and auto_inc column only on slave". It was reported by my colleague Ovais Tariq long time ago and is now fixed.
Bug #71436. "Compressed Tables Regression from 5.6.13". Finally this performance regression when large InnoDB compressed tables are used is fixed. It was reported by my colleague Jervin Real.
Bug #70430. "Race condition between purge coordinator and InnoDB monitor crash on shutdown". It was reported by Laurynas Biveinis from Percona.
Bug #70063. "Create table is slower in 5.6". It was reported by my former colleague Arnaud Adant from Oracle and the problem was tracked down to inserts into the mysql.innodb_index_stats table (one of two to store InnoDB persistent statistics). Maybe this fix will help to run CREATE or ALTER for partitioned tables faster as well - the problem reported recently as a regression comparing to MySQL 5.5 as Bug #72115 - "Very slow create/alter table with partitions and InnoDB".
Bug #65225. "InnoDB miscalculates auto-increment after changing auto_increment_increment" - this one was reported by Elena Stepanova almost 2 years ago and was considered a regression comparing to... MySQL 5.0. Nice to see that we have one less case when old MySQL 5.0 can be considered "better" that recent greatest MySQL 5.6 :)

There are more important fixed in replication and InnoDB categories. Fixes in optimizer are less visible in release notes, but these are really important optimizer bugs that are fixed finally:

Bug #70657. "SELECT DISTINCT...GROUP BY returns wrong results in some cases".
Bug #69969. "Failing assertion: prebuilt->trx->conc_state == 1 from subselect". It was reported by my colleague Raghavendra Prabhu.
Bug #71244. "Wrong result computation using ALL() and GROUP BY".

The last but not the least, for a second release in a row Oracle engineers had fixed something I've reported (even though it's minor enough problem):

Bug #71554. Release notes says: " innodb_ft_result_cache_limit now has a hardcoded maximum value of 4294967295 bytes or (2**32 -1). The maximum value was previously defined as the maximum value of ulong." One of those cases when fixing the server code happens as a result of a simple documentation request.

To summarize, MySQL 5.6.17 seems to be a very good release for those who use new InnoDB and replication features in MySQL 5.6. It fixed several regressions and (assuming no new regression bugs introduced) should be implemented everywhere as soon as possible.

Sunday, March 9, 2014

What's wrong with MySQL Manual

I think that MySQL Manual is one of the reasons why MySQL became and still remains popular. I find a lot of useful information (at least references to share with customers, if not real insights) there even after 9+ years of working with MySQL every day and with all numerous articles and blog posts on most important topics available now. I still have MySQL manual page open at every browser instance on every laptop I use on a regular basis.

It's simply great, well indexed by Google and has meaningful human-readable URLs, so one can even guess them for the topics he need. I have http://dev.mysql.com/doc/refman/5.6/en/merge-storage-engine.html open right now and looking at it I clearly understand without any search that if I need a reference for SELECT syntax in 5.5 I have just replace "5.6" with "5.5" and "merge-storage-engine" with "select" to get there. It works that way and many users even had created some nice marcos or shortcuts for their browser to end up at the proper manual page after typing just the topic they are interested it.

These being said, today I'd like to write about problems with MySQL manual that I care about. Some of them are not new and are hard to fix/avoid, others I consider recent enough. Today I'd like to discuss 4 of them:

MySQL Manual does NOT have enough careful readers who are ready to report problems or missing details.
Old documentation bug reports seem to be ignored without any obvious good reason.
Some documentation pages are useless with their current content.
"How to" kind of official MySQL documentation is missing, and this seems intentional.

The most important problem for MySQL Manual is the fact that it seems it has not enough careful readers who are interested in making it better. Great MySQL Documentation team works hard on improving documentation, but even now, in Oracle, with all the funds theoretically available and almost unlimited resources in general, it seems the team is hardly can keep up with documenting new features and releases in time, so they do not have either time or free pairs of eyes to do real "QA" for the manual other than proofreading maybe.

Also, it's easy to understand that authors who had written the text are not the best people to read it later with a hope to find something useful or some mistake there. They deal with their "ideas" and may have too many things assumed in mind comparing to average user/reader of the manual. They are also very experienced MySQL users and even developers in the past. I had written a book myself long time ago and translated dozens of technical books into Russian, so I know the feeling - for you, as a writer, your text is clear, useful and may be just great. While it is NOT, especially for readers with different technical or cultural background.

As a result we have manual pages that are less than clear and has a lot of useful things missing or improperly documented. When I started to read the manual carefully some time ago (after I stopped writing about MySQL bugs at Facebook I needed some other thing to do to keep my mind in peace) I had found out that I can easily report a bug (if not 5 or more) for any page I open. I see missing details or wrong statements everywhere. I had even "invented" separate "missing manual" tag to classify some of the documentation bugs I've reported. It's so easy to find these problems and report them as bugs in "Server: Documentation" category...

So, MySQL manual has bugs, but not so many it seems (I see 86 active ones at the moment). This is NOT because it's ideal, but more because too few people care to read the manual carefully and report anything they consider wrong or missing. Actually, besides MySQL Support engineers and few other Oracle employees, few of my colleagues at Percona and myself, I know only 3 Community members who report documentation bugs recently:

Daniel van Eeden. He is a well known Community bug reporter and used to be great Oracle customer (at least this is how I remember him from the days when I worked in Oracle MySQL Support). Out of his 78 active bug reports I see 11 for MySQL Server manual, but he also reports bugs in documentation for Oracle's "commercial" MySQL software, like MySQL Enterprise Backup.
Peter Laursen. He is also a well known and productive MySQL bugs reporter for a decade probably. Out of his 83 active bug reports 5 are related to MySQL Manual. They are not very recent though.
Federico Razzoli. He had started to report himself only recently because of privacy-related issues with information requested by Oracle to obtain SSO account for bugs reporting. Before that he asked me to report bugs sometimes, and I did. So, now of his 5 active bugs 2 are related to MySQL Manual.

So, just few reporters who care and, at the same time, problems and deficiencies easy to find everywhere. This is a main problem, IMHO, and the only way to get a hope to solve it is to start reading MySQL Manual and reporting bugs/problems noted! Surely, it's just the first step, somebody has to write/provide real missing content or fixes. If you can suggest some better text, please, do it. If you do not want to wait there are probably alternative ways: you have to document MySQL related things elsewhere. For example, this is what Federico did when he noted that no way to deal with materialized views is documented in usual sources of MySQL-related information. I plan to discuss alternatives to MySQL Manual later. Let's concentrate on what to do to make it better for now.

Other important problem of the manual related to bugs is the fact that some of the bugs reported are NOT fixed in time. Let me share my favorite example here, my Bug #68097. It was reported almost 14 months ago and asked to actually properly document PERFORMANCE_SCHEMA behavior that was intended and explained long time ago - the only way to enable instrumentation for (InnoDB) rwlocks and mutexes is upon server startup. If these instruments where not enabled at startup they are just not available, otherwise you can disable them and enable them back dynamically. I fail to understand what's so hard to write exactly this, as we can find statements about this from Oracle engineers in many places.

As a result, MySQL Manual stays misleading for years! By the way, the oldest documentation bug I see in "Verified" status, Bug #30538, remains in this state without further comments since October 19, 2010. Based on recent comments it may be even somehow fixed, but nobody cares to change status then. And this is for a bug reported back in 2007 by famous customer (probably Google at that time), Mark Callaghan. Obviously we see not enough care about older documentation requests. My recent ones are usually fixed fast though.

Some pages of MySQL Manual exist for years, but their content is almost useless without looking into the code or checking with developers. Great examples can be found in documents devoted to NDB Cluster internals, for example, this page (it's one of many): https://dev.mysql.com/doc/ndbapi/en/ndb-internals-dump-command-2401.html. Nobody even tried to explain the output here, interpretation of the fields, use cases for this DUMP command, nothing. Just few words and example of output that I can probably get myself. How this page is useful to anybody?

Unfortunately, the problem is NOT limited to NDB Cluster internals (that I am happy to honestly not care about at all at the moment). Same happens even to InnoDB. Check http://dev.mysql.com/doc/refman/5.6/en/innodb-monitors.html and tell me where it explains the details about any lock waits related information you can find out in TRANSACTIONS section of INNODB STATUS output. I see only this text:

"If this section reports lock waits, your applications might have lock contention. The output can also help to trace the reasons for transaction deadlocks."

That's all! Really. It does not even explain how "next-key lock" looks in the output vs "record lock"... As a result, one can not explain the way to troubleshoot even simple deadlock based on the manual or, if she tries, she can NOT prove the points with anything but experiments and quotes from the source code. And all that happens when InnoDB lock model does NOT change for decade(s) and source code seems commented in details.

The fourth problem, the lack of official "How to" or "FAQ" documents, is closely related to the above. Not only MySQL Manual sometimes misses a lot of details, you can not get them from http://dev.mysql.com/doc/ (the place to look for official MySQL guides) at all! How to read the output of SHOW ENGINE INNODB STATUS? Now to use PERFORMANCE_SCHEMA to troubleshoot different types of performance problems? You can find answers (of different quality, level of details and age) in great blog posts, books and presentations, maybe in Oracle's knowledge base articles (if you are lucky to be their customer and good in search there) but NOT in official MySQL documentation.

I do not consider this normal or acceptable, especially for the open source software. Check FreeBSD Documentation Project, for example, or Linux Documentation Project to get some ideas of how other open source software approaches the problem.

Surely, adding a collection of "How to" guides probably requires input from Community, but why at least not to try to collect links to all the good existing documents of this kind somewhere at http://dev.mysql.com? Having third party documents of this kind with "officially approved" status after a review by Oracle's MySQL Documentation and/or Support teams would help a lot, I think.

That's all I can complain about at the moment. One day I'll try to continue...

Sunday, March 2, 2014

Fun with Bugs #30 - quick review of my reports in February, 2014

I've got only one comment to my previous post about deadlock, and it was more like a hint based on a different use case, not a real explanation. So far there is nobody who wants to get free beer... Maybe this is even good, as I do not go to the conference and BOF I've submitted will be supervised by my colleague Przemysław Malkowski. But you still have entire month till the conference to get a chance for a beer from him (we'll arrange this somehow).

In the meantime I'd like to review bug reports for MySQL server (few) and fine manual (many) that I've submitted in February, 2014. 22 in total, one was just plain wrong and I closed it as "Not a bug" almost immediately. So, 21 to consider.

Let's start with serious problem. Based on our previous discussion on responsible bug reporting I had not shared the test case with the entire world immediately, but you should know that you probably can crash MySQL 5.5, 5.6 and 5.7 (at least on Windows, but I can not exclude older 5.5 on Linux also) with a single SELECT statement. This was reported as Bug #71858 (security bug from the very beginning, so you will not see it directly).

While working on a real customer problem I had to report a bug related to replication, Bug #71732, "Garbage value in output when MASTER_LOG_FILE='' is set". Having this kind of messages in the error log does not really help to find out fast what was the reason of the problem.

I've considered Bug #71818, "Type column in SHOW PERFORMANCE_SCHEMA STATUS is redundant" annoying at least, but it ended up as "Not a bug" for a reason that SHOW ENGINE INNODB STATUS also has "Type" column for its single row of results, so it's by design, "The column might not be very useful, but it works as intended".

I've spent some time trying to build recent MySQL 5.5+ on Linux using Solaris Studio 12.3 compiler, and mostly failed. In the meantime Bug #71847 was reported, "Predefined compiler flags for SunPro/Solaris Studio 12.x are missing in cmake". Using section from Solaris OS part helped a bit in the process.

I was annoying enough to complain about warnings two times. See Bug #71640, "Somewhat misleading deprecation warning for innodb_lock_monitor" and Bug #71817, "Warnings text is somewhat misleading (mentions max_open_files and table_cache)" for the details.

Surely I don't forget PERFORMANCE_SCHEMA and added a feature request, Bug #71755, "Provide per partition summary information in PERFORMANCE_SCHEMA".

The rest were actually documentation requests. I plan to write a long post about MySQL documentation in general eventually, so let me point out only bug reports I've created while working on explanation for that deadlock:

Bug #71637 - "Manual mentions IS_GAP and IX_GAP locks, but never explains them"
Bug #71638 - "Manual does NOT explain "insert intention" lock mode properly"
Bug #71735 - "Manual does not explain locks set by SELECT ... FOR UPDATE properly"
Bug #71736 - "Manual does not explain locks set by UPDATE properly"

They are all "Verified" for now, so it seems my concerns are valid.

February was a really hot and sad month here in Ukraine, so I was not productive enough with MySQL bugs reporting. Sorry.

Saturday, February 15, 2014

Magic deadlock: what locks are really set by InnoDB?

Megabytes of text had been written already on InnoDB locking and deadlocks. Still, even very simple cases of deadlocks while working with a table having only one row sometimes make people wonder what happened and why.

Today I want to check if this topic is explained well in the manual and existing blog posts and understood properly. So, it's an exercise for my dear readers and those who like to report bugs as much as I do.

Let's consider a very simple example. In session #1 with default transaction isolation level execute the following:

CREATE TABLE `tt` (
`id` int(11) NOT NULL,
`c1` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `c1` (`c1`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
insert into tt values(1,1); -- insert a row there
select * from tt; -- check that we have row (1,1)
begin work;
select * from tt where c1=1 for update; -- we see a row (1,1) and now it's locked for us
update tt set id=id+1 where c1=1; -- and we safely increment the id

So, just a table with one row and two columns. One of columns is a PRIMARY KEY, other is UNIQUE. Why one may need a table like this? Some people (not me) may think this is a good way to create sequences and get globally unique ids generated for a set of rows, with sequence identifier being stored in column I've named c1. This happens, that's why...

Now, in session #2 execute the following:

select * from tt; -- we have row (1,1) there
select * from tt where c1=1 for update; -- let's try to use the sequence...

At this moment session #2 hangs waiting for a lock. It's expected - we actually updated the same row in an yet uncommitted transaction in session #1. This is clear.

Now, in session #1, try to do the following:

select * from tt where c1=1 for update; -- you'll see a row (2,1), as you've already incremented counter

But interesting thing happens in session #2 in the meantime:

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

If you do not trust me, just try this with the following isolation level for both sessions:

mysql> select @@tx_isolation;
+-----------------+
| @@tx_isolation |
+-----------------+
| REPEATABLE-READ |
+-----------------+
1 row in set (0.00 sec)

At this moment you can take a break in reading, and check the output of SHOW ENGINE INNODB STATUS from session #2. On MySQL 5.6.16 here I see the following (details of report may depend on server version and settings):

------------------------
LATEST DETECTED DEADLOCK
------------------------
2014-02-15 18:09:20 16a8
*** (1) TRANSACTION:
TRANSACTION 36657, ACTIVE 4 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 360, 1 row lock(s)
MySQL thread id 3, OS thread handle 0x1884, query id 55 localhost ::1 root statistics
select * from tt where c1=1 for update
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 242 page no 4 n bits 72 index `c1` of table `test`.`tt` trx id 36657 lock_mode X waiting
Record lock, heap no 4 PHYSICAL RECORD: n_fields 2; compact format; info bits 32

0: len 4; hex 80000001; asc     ;;
1: len 4; hex 80000001; asc     ;;

*** (2) TRANSACTION:
TRANSACTION 36656, ACTIVE 10 sec starting index read, thread declared inside InnoDB 5000
mysql tables in use 1, locked 1
6 lock struct(s), heap size 1184, 6 row lock(s), undo log entries 2
MySQL thread id 5, OS thread handle 0x16a8, query id 56 localhost ::1 root statistics
select * from tt where c1=1 for update
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 242 page no 4 n bits 72 index `c1` of table `test`.`tt` trx id 36656 lock_mode X locks rec but not gap
Record lock, heap no 4 PHYSICAL RECORD: n_fields 2; compact format; info bits 32

0: len 4; hex 80000001; asc     ;;
1: len 4; hex 80000001; asc     ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 242 page no 4 n bits 72 index `c1` of table `test`.`tt` trx id 36656 lock_mode X waiting
Record lock, heap no 4 PHYSICAL RECORD: n_fields 2; compact format; info bits 32

0: len 4; hex 80000001; asc     ;;
1: len 4; hex 80000001; asc     ;;

*** WE ROLL BACK TRANSACTION (1)

This is your problem statement: can you explain the deadlock above based on the output of INNODB STATUS or some other sources of information? Do you know what exact locks are really set by each session?

Take your time to write down your exact explanation, in a comment here or in separate blog post. I wonder if your explanation mentions "intention" locks of any kind, "gap" locks, "next key" locks and so on. I've read very creative texts on similar topic, and would not mind to read some more in the comments... I also wonder to know how you are going to prove your points.

We see one record lock that session holds and two lock waits in the above. All of them are obviously related to poor record with values (1,1) in the unique index named c1... You surely know that primary key value is a part of any secondary key in InnoDB, so this is all clear. But why session that already got X lock on this record has to wait to get X lock on the same(?) record? Note also 6 row locks for one of transactions in the report above...

You can surely try to rely on the manual in explaining this. But there I fail to see even clear explanation of what exact locks are set by SELECT ... FOR UPDATE in cases like this. All it says is the following:

"For SELECT ... FOR UPDATE or SELECT ... LOCK IN SHARE MODE, locks are acquired for scanned rows, and expected to be released for rows that do not qualify for inclusion in the result set (for example, if they do not meet the criteria given in the WHERE clause). However, in some cases, rows might not be unlocked immediately because the relationship between a result row and its original source is lost during query execution."

and later:

"For index records the search encounters, SELECT ... FROM ... FOR UPDATE blocks other sessions from doing SELECT ... FROM ... LOCK IN SHARE MODE or from reading in certain transaction isolation levels."

I've already reported some documentation requests while trying to explain this deadlock based on the manual. I'll list them all eventually. I hope you'll report several more bugs while trying to solve and discussing this exercise.

Now, why should you care? Because I am going to buy a beer (or any drink you prefer) for everybody who will post good (even if not entirely correct) explanation of this deadlock and make me aware of it, and/or report new valid bug based on this test case! I may be able to do this during PLMCE 2014 (still not sure if I go there, waiting for the decisions on 2 BOF sessions I've submitted) or during some conference later.

So, let's fun begin! You can expect summary post about this exercise (with a list of winners, my own explanation of this case, if still needed, and some ways to study the details) before April 1, 2014 :)

Sunday, February 9, 2014

On responsible bugs reporting

Let me start with questions related to responsible MySQL bugs reporting that I'd like to be discussed and then present a history behind them.

Assuming that you, my dear reader from MySQL Community, noted or found some simple sequence of SQL statements that, when executed by authenticated MySQL user explicitly having all the privileges needed to execute these statements, crashes some version of your favorite MySQL fork, please, answer the following questions:

Do you consider this kind of a bug a "security vulnerability"?
Should you share complete test case at any public site (MySQL bugs database, Facebook, your personal blog, any)?
Should you share just a description of possible "attack vector", as Oracle does when they publish security bug fixes?
Should you share just a stack trace or failed assertion information, without any details on how to get it?
Should you inform vendor of your MySQL fork about it in a private email officially (secalert_us@oracle.com in case of Oracle)?
Should you try to have it added to CVE database and, if you should, what is the best way to do this?
Should you try to use any informal, but still private ways share this information with MySQL vendors and other interested parties?
What is the best way to do "responsible bugs reporting" in cases like this?

Now, the history behind these questions. Usually when I have to deal with any MySQL issue not obviously caused by a user mistake or misconfiguration, I suspect that the problem may be related to some (known or yet unknown, fixed or still active) MySQL bug. I search MySQL bugs database for error messages and codes, keywords, stack trace entries or assertions user noted, at the earliest stage of investigation. This approach works well enough for me for more than 8 years, as even if the problem was not related to any bug, similar reports may give explantions, workarounds or other hints on possible reasons.

To be able to search bugs faster and have more things readily available "from the top of my head" I try to read new public bug reports whenever they appear. Sometimes I send comments or hints. Often enough I see simple test cases there to check. And I do try to run these test cases in a hope to see the problem more clearly or find out that it is not repeatable for me (to try to understand why is that so later). I do this during my free time and while I work, I do this this for fun and when colleagues ask me for some assistance or, specifically, for any bug that may be related to their problem at hand.

I used to share my finding immediately on Facebook by posting URLs to bugs with my comments, but this activity is now stopped at least till PLMCE 2014 in a hope to find some better way of cooperation on bug fixing and as a part of one simple experiment for my talk proposed for that conference (that was not selected by the committee anyway). So, since the end of November, 2013 I share (rarely) only some questions or test cases from time to time, and ask colleagues to try them on their favorite MySQL, MariaDB or Percona Server versions, if they are also curious.

This week during my usual review of new bug reports in the morning I've noted an "Open" bug with a simple test case, just a couple of UPDATEs for some InnoDB table having a single row of data, that, according to the bug reporter, crashed recently released MySQL 5.6.16 (that I blogged about just a day before) and 5.5.36. Surely I've immediately tried this test case, that crashed my test instances because of assertion failure. I've waited for some time to see that the bug is still open, informed few of my colleagues hanging around at IRC chat about new crashing bug found, waited for some more time and... posted test case in a status update on Facebook. This post got some comments that suspected crash as a result, but not so many. I have maybe 200 friends there and most of them are actually current or former colleagues, MySQL engineers or software developers who work on MySQL and other databases in Oracle, HP, SkySQL, Percona and few other companies. My idea was to make them all immediately aware about potential problem. I've noted that very soon after this post the bug report had "disappeared" - it was marked as Private probably and, I hope, verified and got proper attention from Oracle engineers.

Next morning while checking recent emails quickly I've noted message from my former colleague, who now works in Oracle, asking to remove that Facebook post as explicit publishing of a way to crash MySQL is not a "responsible reporting" - I had to give Oracle a chance and time to fix this, as a potential "security vulnerability".

I've immediately deleted the post based just on respect to the person who asked, but internally I had doubts on what was actually a proper thing to do in this case, and I even felt insulted for a while by this (yet another) attempt to shut me up and stop me from discussing MySQL bugs in public (not that this really happens, but I have feelings like that from time to time). So, I tried to discuss this with my colleagues and found out that some of them do think the bug is actually a potential security vulnerability and thus had to be private. Others (like me) do not believe in any "security through obscurity" practice, especially for open source software, where any interested person can just read the code, find assertion there and then just think when and why it can be hit, based on code analysis.

Formal Oracle policy in this case is well known: "If you are not a customer or partner, please email secalert_us@oracle.com with your discovery". Customers should just open a Support Request.

I was explicitly against applying this policy to MySQL (I've suggested to create public bug reports in my New Year 2013 wishes to Oracle customers back in 2012) among other reasons because "security bugs" and "security issues" have more chances internally in Oracle to become forgotten as they get less visibility (at the same time as they get higher priority) and very few people in Oracle can access or escalate or otherwise monitor them.

Surely, this policy can not be enforced on community members and individuals like me. In the worst case if somebody discloses details of the security vulnerability, she will not get a credit, as: "In order to receive credit, security researchers must follow responsible disclosure practices".

That's the whole story so far. In my case it was not me who found the bug, so why should I care even about credit. My action caused immediate attention from Oracle side and thus I can be sure this specific bug is not forgotten and fixed as soon as possible (by the way, MySQL 5.5.35 is also affected by the same assertion failure, so this is NOT a recent regression and those who use older versions in production are potentially affected). I am sure my post was also noted by "security officers" in MariaDB Foundation and Percona, so in case these forks are also affected we now have chances to get fixes there as well, no matter when and how Oracle fixes the problem.

Still, this case and related private discussions led me to the questions at the beginning of this blog post. I'd like to get your opinions on them.

Sunday, February 2, 2014

Fun with Bugs #29 - important bug fixes in MySQL 5.6.16

As you should already know, Oracle had released MySQL 5.6.16 officially on January 31, 2014. You can find all the details in official release notes. In this post I decided to concentrate on important fixes for bugs reported by community in 4 most important categories: InnoDB, replication, partitioning and optimizer.

Let's start with Bug #70768, "Persistent optimizer statistics often causes LOCK_open stalls". It's well known that persistent statistics for InnoDB tables stored in two (again, InnoDB) tables in mysql database may cause various problems, not only bad plans based on outdated statistics there. One of these problems seems solved in 5.6.16, but note that the bug report itself is closed without any specific comment on what exactly was fixed.

Hardly you are affected by bugs like Bug #70087, "InnoDB can not use the doublewrite buffer properly", every day. But it's great that yet another case of corruption when recovery is possible is now covered. Another problem in case of InnoDB recovery on a server where crash-safe (with information stored in tables) replication is configured was reported as Bug #69907. Here instead of a real fix (if it's at all possible) we've just got more clear error message explaining the inability to store replication related information in tables while innodb_force_recovery > 0.

Having ALTER TABLE statement on InnoDB to work "in place", without making copy of the entire table, is a great feature. Unfortunately for some cases it had not performed the change expected. One of this cases, resetting AUTO_INCREMENT with ALTER TABLE, now works again without explicitly mentioned ALGORITHM thanks to the fix for Bug #69882. Good to see this kind of a regression comparing to older MySQL versions fixed. Another fix of a regression bug, Bug #69424, finally allowed to use raw devices for InnoDB shared tablespace in MySQL 5.6. Not that it's common these days, but still could cause problems during upgrade.

Yet another bug in ALTER TABLE, Bug #69325, noted and reported by my colleague Justin Swanhart, could badly affect users with big partitioned InnoDB tables. Attempts to add indexes to such a table caused huge and unexpected use of memory. Now the amount of memory used is somewhat reduced it seems.

The last but not the least, Oracle fixed my Bug #68079! A year of work, detailed analysis, workarounds suggested and now a real fix with totally new implementation! Well done, really! I should test the fix tomorrow. It's not the first time I write about this bug (I've created entire presentation around it), and not the last...

For partitioning itself (besides InnoDB-related cases mentioned above) MySQL 5.6.16 had not added many fixes. Still, fix for Bug #70588 should be mentioned, as now we have one less case of wrong results against partitioned table. I am surprised badly that, for example, fix for my Bug #71095 (that is public since December 6, 2013, and really helped to solve a problem) is NOT included.

In replication category Oracle engineers were mostly concentrated on fixing bugs reported internally. I see only this (looking minor enough) semi-sync replication public bug fixed, Bug #70410.

I had not expected many fixes in optimizer, but there are fixes to mention. Check Bug #69005, for example, if you order by columns with utf8_bin collation. Now this regression bug is fixed and order should be correct. Another case of ORDER BY failing, this time with views, is fixed in frames of Bug #69678.

Some other optimizer bug fixes to check are the following:

Bug #70025, "Complex updates of Performance Schema tables involving joins or subqueries failed to update every row."
Bug #71028, "COUNT(DISTINCT) sometimes produced an incorrect result when the last read row contained a NULL value."
Bug #70608, "Incorrect result could be returned for a query with an IF() predicate in the WHERE clause combined with OUTER JOIN in a subquery that is transformed to a semi-join.". It was a regression comparing to 5.5, so more good news to those who plan to upgrade.

There are some other fixes I can not comment on right now, as bug report still remains private. So, we can only trust release notes in cases like this:

"A full-text search combined with derived tables (subqueries in the FROM clause) caused a server exit.

Now if a full-text operation depends on a derived table, the server produces an error indicating that a full-text search cannot be done on a materialized table. (Bug #68751, Bug #16539903)"

To summarize, 5.6.16 seems useful release, with many fixes in InnoDB and optimizer, including fixes to regression bugs and performance improvements. I had not tested it in production yet, but at least upgrade from 5.6.15 on Windows was seamless and I have it up and running for further checks. Stay tuned!