www.bortolotto.eu

Newsfeeds
Planet MySQL
Planet MySQL - https://planet.mysql.com

  • Surprise with innodb_doublewrite_pages in MySQL 8.0.20+
    In a recent post, The Quirks of Index Maintenance in Open Source Databases, I compared the IO load generated by open source databases while inserting rows in a table with many secondary indexes. Because of its change buffer, InnoDB was the most efficient solution. However, that’s not the end of the story. Evolution of the […]

  • On Database Query Performance in HeatWave and MySQL. Interview with Kaan Kara 
    “ Of course, in practice, no query optimizer is perfect and there will be edge cases where the way a query is written will impact its performance.” Q1. What are your current responsibilities as Principal Member of Technical staff? Kaan Kara : I am contributing as the tech lead for query execution in HeatWave. My main responsibility is implementing new features in HeatWave, maintaining its stability, and supporting our customers with their HeatWave-related use cases. Q2. Let´s talk about improving database query execution time. The way a query is written has a massive impact on its performance, and developers often face hurdles in structuring them optimally. What is your take on this? Kaan Kara : SQL is a declarative language. That means, in ideal terms, the database optimizer should produce the best query plan possible to answer the query, no matter how it is written. So, there should not be a need to optimize queries at SQL level. This is what we strive for when designing optimizers. Of course, in practice, no query optimizer is perfect and there will be edge cases where the way a query is written will impact its performance. I believe there are two practical ways a database service can help address this: The first approach is providing insights into the query plan and its execution. Our goals is to offer detailed and understandable insights about the query plan to our customers, so that they can see where the bottlenecks are, for more info please click here and here. Once they see the bottleneck, they can think about how the query can be rewritten or certain optimizer hints could help, and so on.Secondly, it is important that the database itself provides alternative execution schemes or user-guided optimization methods. For instance, we recently introduced materialized temporary tables in HeatWave. Once the user sees that a certain query subtree is taking a long time, they can decide to create a materialized view on it, substantially accelerating their queries. Q3. Indexing is the most common and effective way to speed up queries, what are the major source of challenges developers face? Kaan Kara : Indexes come with maintenance cost, and they are often used without proper analysis of the trade-offs between that cost and the performance benefit they provide. HeatWave, with its in-memory columnar data architecture, helps eliminate the need for most indexing in analytical workloads. However, there are certain use cases where indexes provide value. One example is vector embedding-based nearest neighbor search, where index-based lookup is needed to ensure low response times. After introducing native VECTOR type last year, ), we recently introduced VECTOR-based indexing in HeatWave, enabling our customers to run approximate nearest neighbor search queries up to 2 orders of magnitude faster. One interesting direction we took was that we did not want to sacrifice on the result fidelity. We are employing a novel method that utilizes the index only when we believe the results it produces will be accurate. Q4. Sometimes, the problem isn’t the query itself but the foundation it’s built on. Can you share your experience with this? Kaan Kara : That is a very good point. Schema design plays a critical role in performance optimization. In some use cases, we see queries with predicates based on complex string operations or regular expressions, which make the query much slower than if the same predicate were applied to numeric columns. But this ties back to ease of use and declarative nature of interacting with databases. Ideally, the user should not have to worry about these things and do the most convenient thing, and the database should take care of optimizations behind the scenes. In HeatWave, we strive to achieve this goal guided by real-world use cases from our customers. For example, we often observe read-heavy workloads repeatedly running the same expensive query subtree. To address this, we are developing an automated result cache that can materialize this subtree result within HeatWave and use it later when it is needed. We believe this feature will significantly improve query performance in many scenarios. Q5. In a real-world application, a query doesn’t run in isolation. The performance of MySQL is heavily dependent on its configuration. What are your recommendations here? Kaan Kara : That is true. Thankfully, we have a set of features in our Autopilot suite, which eliminate much of the configuration guesswork. For instance, depending on user’s data and sample queries, Autopilot suggests the correct cluster size, data placement key, appropriate column encodings, and much more. But it is usually not a one and done approach with configuration. User’s data and queries change over time. So, it is also crucial to provide detailed insights into the system consistently, so that adjustments can be made.An example is the need for efficient compute up and down scaling. Some customers require more compute in their peak operating hours for faster queries. In HeatWave, we provide zero downtime compute elasticity (YouTube video), thanks to our partitioning-based data architecture to cater for that need. Q6. Beyond query-level tuning, what are the most significant architectural challenges that impede query performance, such as handling I/O bottlenecks from large table scans, managing inefficient data access patterns caused by normalization choices, or addressing network latency in distributed database environments? Kaan Kara : This is a great question and one of the core things that we deal with daily when optimizing the HeatWave query execution engine. For an efficient distributed analytics engine, optimizing for I/O bottlenecks (for HeatWave, this means primarily memory and network) is at the top of the priority list. HeatWave has many optimizations to reduce these bottlenecks. For instance, we utilize an efficient vectorized bloom-filter to reduce the amount of probe-side data that we need to shuffle around in our cluster when performing a distributed join. Driven by our customer workloads, recently we have worked on a late-materialization feature. Our customers work with wide string columns frequently. To reduce frequent access to these, we perform a transformation in our logical plan: Any wide columns that are not needed are removed from leaf table scan nodes; instead, we project the primary keys. Later in the plan, we introduce additional joins utilizing these primary keys to gather the wide columns that the query needs to produce the result. This feature will improve performance for certain production queries which project many wide columns by a significant amount. Q7. Specifically, as of MySQL 9.3.0, it is possible to create temporary tables that are stored in the MySQL HeatWave Cluster. What are these table used for? Kaan Kara : Yes, our customers can now create temporary tables directly within HeatWave, as in-memory materialized tables. Previously, the only way to load a table into HeatWave was through loading an InnoDB table or loading an external table from object storage. But sometimes, users want to store the result of a query as a temporary materialization without going through the load path, which can be a bottleneck. Q8. Are these tables similar to conventional database views? Kaan Kara : They are very similar to materialized views, but temporary tables are static. So, changes in the base tables will not be propagated and temporary tables themselves cannot be changed. If the customer use case requires change propagation from base tables, then materialized views are the right approach, which will be supported soon in HeatWave. Q9. Can you please explain how these MySQL HeatWave temporary table help reducing query execution time? Kaan Kara : Let me give an example: Consider an analyst investigating the transactions on a certain publicly traded stock. The queries will need to perform a join between “stocks” and “transactions” tables on some stock-id, followed by further aggregations (getting volume by date) or maybe further joins and ordering (sorting by largest buyers in each timeframe) etc. In this example, the initial join between “stocks” and “transactions” needs to be performed repeatedly and can be an expensive part of the queries. The analyst can now create a materialized temporary table based on the result of this join directly within HeatWave and it can be used later as much as needed by other operations. Q10. Is calculating the Load factor, i.e. measuring of how full a hash table is, really a good metric to calculate Query Execution Times? Or are there any metrics that need to be taken into consideration? Kaan Kara : By itself, it is a narrow metric and only relevant to figure out a single join’s or an aggregation’s cost. During our physical compilation, this metric contributes to our cost estimation indirectly: Depending on a join’s build side cardinality or a group-by’s output cardinality, we choose an appropriate hash table size. This size then dictates the runtime and memory cost of each operation. To estimate the query cost holistically, all relational operators along with how much data will be moved around is then considered. Q11. What is your next project you wish to work on? Kaan Kara : My next projects are around automatic maintenance of materialized views within HeatWave. This entails automatic substitution and creation of materialized views. We are excited to share more soon. ……………………………………………………… Kaan Kara is a principal member of technical staff at Oracle, working as a lead developer mainly responsible for query execution in HeatWave MySQL. As part of the HeatWave team, he has led multiple projects that substantially improved the performance and the memory efficiency of the query execution engine. A sample of the projects include pipelined relational operator execution, bloom-filter enhanced distributed joins, base relation compression, and late decompression optimizations. Collectively, these improvements led to factors of geomean reduction in analytical benchmarks, such as TPC-H and TPC-DS,  while reducing the memory requirements of the in-memory execution engine, enabling a single HeatWave node with 512GB memory to run the 1TB TPC-H benchmark in full. More recently, he was the lead developer introducing the new VECTOR type to MySQL, along with highly optimized vector processing functions within HeatWave, laying the data layer foundation that enabled highly anticipated vector store features within HeatWave, such as semantic search and retrieval-augmented generation. Prior to joining Oracle, Kaan received his doctoral degree in 2020 from ETH Zurich, Systems Group in Computer Science Department. His research focused on using reconfigurable hardware devices (FPGAs) to accelerate data analytics. He has published papers in top database venues such as VLDB and SIGMOD, showcasing the potential benefit of FPGA-based implementations for data partitioning and in-database machine learning tasks. Resources On HeatWave MySQL: Query Execution, Performance, Benchmarks, and Vector type. Q&A with Kaan Kara. ODBMS.ORG MARCH 4, 2025 ……………………………………. Follow us on X Follow us on LinkedIn

  • MySQL Basics: Safe and Sound—User Management and Database Security
    Keep your MySQL database protected with beginner-friendly advice on user management and security! Learn how to create users, grant and revoke privileges, and safeguard your data using library-themed examples and practical tips. Essential reading for every data explorer eager to lock up their digital collection.

  • Announcing Vitess 23.0.0
    Announcing Vitess 23.0.0 # We’re excited to release Vitess 23.0.0 — the latest major version of Vitess — bringing new defaults, better operational tooling, and refined metrics. This release builds on the strong foundation of version 22 and is designed to make deployment and observability smoother, while continuing to scale MySQL workloads horizontally with confidence. ✅ Why This Release Matters # For production users of Vitess, this release is meaningful in several ways:

  • Scoped Vector Search with the MyVector Plugin for MySQL – Part II
    Subtitle: Schema design, embedding workflows, hybrid search, and performance tradeoffs explained. Quick Recap from Part 1 In Part 1, we introduced the MyVector plugin — a native extension that brings vector embeddings and HNSW-based approximate nearest neighbor (ANN) search into MySQL. We covered how MyVector supports scoped queries (e.g., WHERE user_id = X) to ensure that semantic search remains relevant, performant, and secure in real-world multi-tenant applications. Now in Part 2, we move from concept to implementation: How to store and index embeddings How to design embedding workflows How hybrid (vector + keyword) search works How HNSW compares to brute-force search How to tune for performance at scale 1. Schema Design for Vector Search The first step is designing tables that support both structured and semantic data. A typical schema looks like: CREATE TABLE documents ( id BIGINT PRIMARY KEY, user_id INT NOT NULL, title TEXT, body TEXT, embedding VECTOR(384), INDEX(embedding) VECTOR ); Design tips: Use VECTOR(n) to store dense embeddings (e.g., 384-dim for MiniLM). Always combine vector queries with SQL filtering (WHERE user_id = …, category = …) to scope the search space. Use TEXT or JSON fields for hybrid or metadata-driven filtering. Consider separating raw text from embedding storage for cleaner pipelines. 2. Embedding Pipelines: Where and When to Embed MyVector doesn’t generate embeddings — it stores and indexes them. You’ll need to decide how embeddings are generated and updated: a. Offline (batch) embedding Run scheduled jobs (e.g., nightly) to embed new rows. Suitable for static content (documents, articles). Can be run using Python + HuggingFace, OpenAI, etc. # Python example from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") vectors = model.encode(["Your text goes here"]) b. Write-time embedding Embed text when inserted via your application. Ensures embeddings are available immediately. Good for chat apps, support tickets, and notes. c. Query-time embedding Used for user search input only. Transforms search terms into vectors (not stored). Passed into queries like: ORDER BY L2_DISTANCE(embedding, '[query_vector]') ASC 3. Hybrid Search: Combine Text and Semantics Most real-world search stacks benefit from combining keyword and vector search. MyVector enables this inside a single query: SELECT id, title FROM documents WHERE MATCH(title, body) AGAINST('project deadline') AND user_id = 42 ORDER BY L2_DISTANCE(embedding, EMBED('deadline next week')) ASC LIMIT 5; This lets you: Narrow results using lexical filters Re-rank them semantically All in MySQL — no sync to external vector DBs This hybrid model is ideal for support systems, chatbots, documentation search, and QA systems. 4. Brute-Force vs. HNSW Indexing in MyVector When it comes to similarity search, how you search impacts how fast you scale. Brute-force search Compares the query against every row Guarantees exact results (100% recall) Simple but slow for >10K rows SELECT id FROM documents ORDER BY COSINE_DISTANCE(embedding, '[query_vector]') ASC LIMIT 5; HNSW: Hierarchical Navigable Small World Graph-based ANN algorithm used by MyVector Fast and memory-efficient High recall (~90–99%) with tunable parameters (ef_search, M) CREATE INDEX idx_vec ON documents(embedding) VECTOR COMMENT='{"HNSW_M": 32, "HNSW_EF_CONSTRUCTION": 200}'; Comparison FeatureBrute ForceHNSW (MyVector)Recall 100% ~90–99%Latency (1M rows) 100–800ms+ ~5–20msIndexing None RequiredFiltering Support Yes YesIdeal Use CaseSmall datasetsProduction search 5. Scoped Search as a Security Boundary Because MyVector supports native SQL filtering, you can enforce access boundaries without separate vector security layers. Patterns: WHERE user_id = ? → personal search WHERE org_id = ? → tenant isolation Use views or stored procedures to enforce access policies You don’t need to bolt access control onto your search engine — MySQL already knows your users. 6. HNSW Tuning for Performance MyVector lets you tune index behavior at build or runtime: ParamPurposeEffectMGraph connectivityHigher = more accuracy + RAMef_searchTraversal breadth during queriesHigher = better recall, more latencyef_constructionIndex quality at build timeAffects accuracy and build cost Example: ALTER INDEX idx_vec SET HNSW_M = 32, HNSW_EF_SEARCH = 100; You can also control ef_search per session or per query soon (planned feature). TL;DR: Production Patterns with MyVector Use VECTOR(n) columns and HNSW indexing for fast ANN search Embed externally using HuggingFace, OpenAI, Cohere, etc. Combine text filtering + vector ranking for hybrid search Use SQL filtering to scope vector search for performance and privacy Tune ef_search and M to control latency vs. accuracy Coming Up in Part 3 In Part 3, we’ll explore real-world implementations: Semantic search Real-time document recall Chat message memory + re-ranking Integrating MyVector into RAG and AI workflows We’ll also show query plans and explain fallbacks when HNSW is disabled or brute-force is needed.