Jun
14
2019
--

Bloom Indexes in PostgreSQL

Bloom Indexes in PostgreSQL

PostgreSQL LogoThere is a wide variety of indexes available in PostgreSQL. While most are common in almost all databases, there are some types of indexes that are more specific to PostgreSQL. For example, GIN indexes are helpful to speed up the search for element values within documents. GIN and GiST indexes could both be used for making full-text searches faster, whereas BRIN indexes are more useful when dealing with large tables, as it only stores the summary information of a page. We will look at these indexes in more detail in future blog posts. For now, I would like to talk about another of the special indexes that can speed up searches on a table with a huge number of columns and which is massive in size. And that is called a bloom index.

In order to understand the bloom index better, let’s first understand the bloom filter data structure. I will try to keep the description as short as I can so that we can discuss more about how to create this index and when will it be useful.

Most readers will know that an array in computer sciences is a data structure that consists of a collection of values and variables. Whereas a bit or a binary digit is the smallest unit of data represented with either 0 or 1. A bloom filter is also a bit array of m bits that are all initially set to 0.

A bit array is an array that could store a certain number of bits (0 and 1). It is one of the most space-efficient data structures to test whether an element is in a set or not.

Why use bloom filters?

Let’s consider some alternates such as list data structure and hash tables. In the case of a list data structure, it needs to iterate through each element in the list to search for a specific element. We can also try to maintain a hash table where each element in the list is hashed, and we then see if the hash of the element we are searching for matches a hash in the list. But checking through all the hashes may be a higher order of magnitude than expected. If there is a hash collision, then it does a linear probing which may be time-consuming. When we add hash tables to disk, it requires some additional IO and storage. For an efficient solution, we can look into bloom filters which are similar to hash tables.

Type I and Type II errors

While using bloom filters, we may see a result that falls into a

type I error

but never a

type II error

. A nice example of a type I error is a result that a person with last name: “vallarapu” exists in the relation: foo.bar whereas it does not exist in reality (a

false positive

conclusion). An example for a type II error is a result that a person with the last name as “vallarapu” does not exist in the relation: foo.bar, but in reality, it does exist (a

false negative

conclusion). A bloom filter is 100% accurate when it says the element is not present. But when it says the element is present, it may be 90% accurate or less. So it is usually called a

probabilistic data structure

.

The bloom filter algorithm

Let’s now understand the algorithm behind bloom filters better. As discussed earlier, it is a bit array of m bits, where m is a certain number. And we need a k number of hash functions. In order to tell whether an element exists and to give away the item pointer of the element, the element (data in columns) will be passed to the hash functions. Let’s say that there are only two hash functions to store the presence of the first element “avi” in the bit array. When the word “avi” is passed to the first hash function, it may generate the output as 4 and the second may give the output as 5. So now the bit array could look like the following:

All the bits are initially set to 0. Once we store the existence of the element “avi” in the bloom filter, it sets the 4th and 5th bits to 1. Let’s now store the existence of the word “percona”. This word is again passed to both the hash functions and assumes that the first hash function generates the value as 5 and the second hash function generated the value as 6. So, the bit array now looks like the following – since the 5th bit was already set to 1 earlier, it doesn’t make any modifications there:

Now, consider that our query is searching for a predicate with the name as “avi”. The input: “avi” will now be passed to the hash functions. The first hash function returns the value as 4 and the second returns the value as 5, as these are the same hash functions that were used earlier. Now when we look in position 4 and 5 of the bloom filter (bit array), we can see that the values are set to 1. This means that the element is present.

Collision with bloom filters

Consider a query that is fetching the records of a table with the name: “don”. When this word “don” is passed to both the hash functions, the first hash function returns the value as 6 (let’s say) and the second hash function returns the value as 4. As the bits at positions 6 and 4 are set to 1, the membership is confirmed and we see from the result that a record with the name: “don” is present. In reality, it is not. This is one of the chances of collisions. However, this is not a serious problem.

A point to remember is – “The fewer the hash functions, the more the chances of collisions. And the more the hash functions, lesser the chances of collision. But if we have k hash functions, the time it takes for validating membership is in the order of k“.

Bloom Indexes in PostgreSQL

As you’ll now have understood bloom filters, you’ll know a bloom index uses bloom filters. When you have a table with too many columns, and there are queries using too many combinations of columns  – as predicates – on such tables, you could need many indexes. Maintaining so many indexes is not only costly for the database but is also a performance killer when dealing with larger data sets.

So, if you create a bloom index on all these columns, a hash is calculated for each of the columns and merged into a single index entry of the specified length for each row/record. When you specify a list of columns on which you need a bloom filter, you could also choose how many bits need to be set per each column. The following is an example syntax with the length of each index entry and the number of bits per a specific column.

CREATE INDEX bloom_idx_bar ON foo.bar USING bloom (id,dept_id,zipcode)
WITH (length=80, col1=4, col2=2, col3=4);

length

is rounded to the nearest multiple of 16. Default is 80. And the maximum is 4096. The default

number of bits

per column is 2. We can specify a maximum of 4095 bits.

Bits per each column

Here is what it means in theory when we have specified length = 80 and col1=2, col2=2, col3=4. A bit array of length 80 bits is created per row or a record. Data inside col1 (column1) is passed to two hash functions because col1 was set to 2 bits. Let’s say these two hash functions generate the values as 20 and 40. The bits at the 20th and 40th positions are set to 1 within the 80 bits (m) since the length is specified as 80 bits. Data in col3 is now passed to four hash functions and let’s say the values generated are 2, 4, 9, 10. So four bits – 2, 4, 9, 10 –are set to 1 within the 80 bits.

There may be many empty bits, but it allows for more randomness across the bit arrays of each of the individual rows. Using a signature function, a signature is stored in the index data page for each record along with the row pointer that points to the actual row in the table. Now, when a query uses an equality operator on the column that has been indexed using bloom, a number of hash functions, as already set for that column, are used to generate the appropriate number of hash values. Let’s say four for col3 – so 2, 4, 9, 10. The index data is extracted row-by-row and searched if the rows have those bits (bit positions generated by hash functions) set to 1.

And finally, it says a certain number of rows have got all of these bits set to 1. The greater the length and the bits per column, the more the randomness and the fewer the false positives. But the greater the length, the greater the size of the index.

Bloom Extension

Bloom index is shipped through the contrib module as an extension, so you must create the bloom extension in order to take advantage of this index using the following command:

CREATE EXTENSION bloom;

Example

Let’s start with an example. I am going to create a table with multiple columns and insert 100 million records.

percona=# CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5 int,id6 int,id7 int,details text, zipcode int);
CREATE TABLE
percona=# INSERT INTO foo.bar SELECT (random() * 1000000)::int, (random() * 1000000)::int,
(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,
(random() * 1000000)::int,(random() * 1000000)::int,md5(g::text), floor(random()* (20000-9999 + 1) + 9999)
from generate_series(1,100*1e6) g;
INSERT 0 100000000

The size of the table is now 9647 MB as you can see below.

percona=# \dt+ foo.bar
List of relations
Schema | Name | Type  | Owner    | Size    | Description
-------+------+-------+----------+---------+-------------
foo    | bar  | table | postgres | 9647 MB | (1 row)

Let’s say that all the columns: id, dept, id2, id3, id4, id5, id6 and zip code of table: foo.bar are used in several queries in random combinations according to different reporting purposes. If we create individual indexes on each column, it is going to take almost 2 GB disk space for each index.

Testing with btree indexes

We’ll try creating a single btree index on all the columns that are most used by the queries hitting this table. As you can see in the following log, it took 91115.397 ms to create this index and the size of the index is 4743 MB.

postgres=# CREATE INDEX idx_btree_bar ON foo.bar (id, dept, id2,id3,id4,id5,id6,zipcode);
CREATE INDEX
Time: 91115.397 ms (01:31.115)
postgres=# \di+ foo.idx_btree_bar
                             List of relations
 Schema |     Name      | Type  |  Owner   | Table |  Size   | Description
--------+---------------+-------+----------+-------+---------+-------------
 foo    | idx_btree_bar | index | postgres | bar   | 4743 MB |
(1 row)

Now, let’s try some of the queries with a random selection of columns. You can see that the execution plans of these queries are 2440.374 ms and 2406.498 ms for query 1 and query 2 respectively. To avoid issues with the disk IO, I made sure that the execution plan was captured when the index was cached to memory.

Query 1
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id4 = 295294 and zipcode = 13266;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Index Scan using idx_btree_bar on bar  (cost=0.57..1607120.58 rows=1 width=69) (actual time=1832.389..2440.334 rows=1 loops=1)
   Index Cond: ((id4 = 295294) AND (zipcode = 13266))
 Planning Time: 0.079 ms
 Execution Time: 2440.374 ms
(4 rows)
Query 2
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id5 = 281326 and id6 = 894198;
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_btree_bar on bar  (cost=0.57..1607120.58 rows=1 width=69) (actual time=1806.237..2406.475 rows=1 loops=1)
   Index Cond: ((id5 = 281326) AND (id6 = 894198))
 Planning Time: 0.096 ms
 Execution Time: 2406.498 ms
(4 rows)

Testing with Bloom Indexes

Let’s now create a bloom index on the same columns. As you can see from the following log, there is a huge size difference between the bloom (1342 MB) and the btree index (4743 MB). This is the first win. It took almost the same time to create the btree and the bloom index.

postgres=# CREATE INDEX idx_bloom_bar ON foo.bar USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4, col8=4);
CREATE INDEX
Time: 94833.801 ms (01:34.834)
postgres=# \di+ foo.idx_bloom_bar
                             List of relations
 Schema |     Name      | Type  |  Owner   | Table |  Size   | Description
--------+---------------+-------+----------+-------+---------+-------------
 foo    | idx_bloom_bar | index | postgres | bar   | 1342 MB |
(1 row)

Let’s run the same queries, check the execution time, and observe the difference.

Query 1
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id5 = 326756 and id6 = 597560;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on bar  (cost=1171823.08..1171824.10 rows=1 width=69) (actual time=1265.269..1265.550 rows=1 loops=1)
   Recheck Cond: ((id4 = 295294) AND (zipcode = 13266))
   Rows Removed by Index Recheck: 2984788
   Heap Blocks: exact=59099 lossy=36090
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..1171823.08 rows=1 width=0) (actual time=653.865..653.865 rows=99046 loops=1)
         Index Cond: ((id4 = 295294) AND (zipcode = 13266))
 Planning Time: 0.073 ms
 Execution Time: 1265.576 ms
(8 rows)
Query 2
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id5 = 281326 and id6 = 894198;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on bar  (cost=1171823.08..1171824.10 rows=1 width=69) (actual time=950.561..950.799 rows=1 loops=1)
   Recheck Cond: ((id5 = 281326) AND (id6 = 894198))
   Rows Removed by Index Recheck: 2983893
   Heap Blocks: exact=58739 lossy=36084
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..1171823.08 rows=1 width=0) (actual time=401.588..401.588 rows=98631 loops=1)
         Index Cond: ((id5 = 281326) AND (id6 = 894198))
 Planning Time: 0.072 ms
 Execution Time: 950.827 ms
(8 rows)

From the above tests, it is evident that the bloom indexes performed better. Query 1 took 1265.576 ms with a bloom index and 2440.374 ms with a btree index. And query 2 took 950.827 ms with bloom and 2406.498 ms with btree. However, the same test will show a better result for a btree index, if you would have created a btree index on those 2 columns only (instead of many columns).

Reducing False Positives

If you look at the execution plans generated after creating the bloom indexes (consider Query 2),  98631 rows are considered to be matching rows. However, the output says only one row. So, the rest of the rows – all 98630 – are false positives. The btree index would not return any false positives.

In order to reduce the false positives, you may have to increase the signature length and also the bits per column through some of the formulas mentioned in this interesting blog post through experimentation and testing. As you increase the signature length and bits, you might see the bloom index growing in size. Nevertheless, this may reduce false positives. If the time spent is greater due to the number of false positives returned by the bloom index, you could increase the length. If increasing the length does not make much difference to the performance, then you can leave the length as it is.

Points to be carefully noted

  1. In the above tests, we have seen how a bloom index has performed better than a btree index. But, in reality, if we had created a btree index just on top of the two columns being used as predicates, the query would have performed much faster with a btree index than with a bloom index. This index does not replace a btree index unless we wish to replace a chunk of the indexes with a single bloom index.
  2. Just like hash indexes, a bloom index is applicable for equality operators only.
  3. Some formulas on how to calculate the appropriate length of a bloom filter and the bits per column can be read on Wikipedia or in this blog post.

Conclusion

Bloom indexes are very helpful when we have a table that stores huge amounts of data and a lot of columns, where we find it difficult to create a large number of indexes, especially in OLAP environments where data is loaded from several sources and maintained for reporting. You could consider testing a single bloom index to see if you can avoid implementing a huge number of individual or composite indexes that could take additional disk space without much performance gain.

Jun
14
2019
--

On the Road: Russia, UK and France Summer MeetUps

Not long after this year’s SouthEast LinuxFest 2019 in mid-June I’ll be speaking at a series of open source database meetups in Russia & UK, plus a big data MeetUp in France. At Percona we’re well aware of the growing number of Percona software users around the world, and whenever we have a chance to meet users at relevant local events we like to take advantage of the opportunity. The meetups listed below are free to attend. As well as the technical topics on the agendas, I’ll be sharing Percona plans and vision and ready to answer your questions. If you can make it, please come and say hi.

On behalf of team Percona I express gratitude to Mail.Ru Group in Moscow, Selectel in St.Petersburg and Elonsoft & IT61 Community in Rostov, Federico Razzoli in London, and Anastasia Lieva of the Big Data/Data Science MeetUp, Montpellier for hosting and organizing these events.

Saint Petersburg, June 26

Venue: Selectel office, Tsvetochnaya street, 21, Saint Petersburg
Date and time: Wednesday, June 26, 2019 meet at 6.30pm for a 7pm start
Registration: https://percona-events.timepad.ru/event/999696/

Programme

I’ll be sharing the floor with Sergei Petrunia of MariaDB Corporation.

  • Ten Things a Developer Should Know About Databases Peter Zaitsev, CEO, Percona
  • MariaDB 10.4 – What’s New? Sergei Petrunia, Software Developer, MariaDB Corporation

Rostov-on-Don, June 27

Venue: Rubin co-working space, Teatralniy avenue, 85, 4th floor, Rostov-on-Don
Date and time: Thursday, June 27, 2019 meet at 6.30pm for a 7pm start
Registration: https://percona-events.timepad.ru/event/999741/

Programme

In Rostov-on-Don, I’ll be presenting two talks:

  • Ten Things a Developer Should Know About Databases
  • MySQL: Scaling & High Availability

Moscow, July 1

Venue: Mail.Ru Group office, “Cinema hall” room,  Leningradsky avenue, 39, build. 79, Moscow
Date and time: Monday, July 1, 2019 meet at 6.00pm for a 6.30pm start
Registration: https://corp.mail.ru/ru/press/events/601/

Programme

In Moscow, I’ll be joined by Vlad Fedorkov of ProxySQL and Kirill Yukhin of Mail.Ru.

  • Ten Things a Developer Should Know About Databases Peter Zaitsev, CEO, Percona
  • ProxySQL 2.0: How to Help MySQL Cope with Heavy Workload Vlad Fedorkov, Lead Consultant, ProxySQL
  • Tarantool: Now with SQL Kirill Yukhin, Engineering Team Lead, Mail.Ru

London, July 4

Venue: Innovation Warehouse, 1 East Poultry Avenue, London EC1A 9PT
Date and time: Thursday, July 4, 2019 meet at 6.00pm for a 6.30pm start
Registration: London Open Source Databases MeetUp

Programme

It would be great to see some new faces and interesting topics at the lightning talks, this is a great idea from Federico and I hope that people take him up on this.

  • London OSDB community lightning talks, if you’d like to present please contact Federico!
  • Ten Things a Developer Should Know About Databases
  • Ask Me Anything!

Montpellier, July

At the time of writing this blog we’ve a few things to iron out, but save the date and check out the MeetUp page… I’ll update this blog as soon as I have firm information.

Venue: tbc
Date and time: Tuesday, July 9 at 7:00pm 
MeetUp group: Big Data/Data Science MeetUp, Montpellier

Programme

I’ll be presenting two talks:

  • Performance Analyses and Troubleshooting Technologies for Databases
  • Data Visualization with Grafana

It’d be good to see you…

Whether you’re a customer, a technology user, a data enthusiast, thinking about applying for a role at a great company (Percona!), or perhaps just passionate about open source software you’re invited to come and say “Hi” and ask questions of me or of my fellow speakers.

I look forward to seeing you! If you have any questions about the events, please feel free to contact Percona’s community team.

 


Map Photo by Ian on Unsplash

Jun
13
2019
--

VMware announces intent to buy Avi Networks, startup that raised $115M

VMware has been trying to reinvent itself from a company that helps you build and manage virtual machines in your data center to one that helps you manage your virtual machines wherever they live, whether that’s on prem or the public cloud. Today, the company announced it was buying Avi Networks, a 6-year old startup that helps companies balance application delivery in the cloud or on prem in an acquisition that sounds like a pretty good match. The companies did not reveal the purchase price.

Avi claims to be the modern alternative to load balancing appliances designed for another age when applications didn’t change much and lived on prem in the company data center. As companies move more workloads to public clouds like AWS, Azure and Google Cloud Platform, Avi is providing a more modern load balancing tool, that not only balances software resource requirements based on location or need, but also tracks the data behind these requirements.

Diagram: Avi Networks

VMware has been trying to find ways to help companies manage their infrastructure, whether it is in the cloud or on prem, in a consistent way, and Avi is another step in helping them do that on the monitoring and load balancing side of things, at least.

Tom Gillis, senior vice president and general manager for the networking and security business unit at VMware sees this acquisition as fitting nicely into that vision. “This acquisition will further advance our Virtual Cloud Network vision, where a software-defined distributed network architecture spans all infrastructure and ties all pieces together with the automation and programmability found in the public cloud. Combining Avi Networks with VMware NSX will further enable organizations to respond to new opportunities and threats, create new business models, and deliver services to all applications and data, wherever they are located,” Gillis explained in a statement.

In a blog post,  Avi’s co-founders expressed a similar sentiment, seeing a company where it would fit well moving forward. “The decision to join forces with VMware represents a perfect alignment of vision, products, technology, go-to-market, and culture. We will continue to deliver on our mission to help our customers modernize application services by accelerating multi-cloud deployments with automation and self-service,” they wrote. Whether that’s the case, time will tell.

Among Avi’s customers, which will now become part of VMware are Deutsche Bank, Telegraph Media Group, Hulu and Cisco. The company was founded in 2012 and raised $115 million, according to Crunchbase data. Investors included Greylock, Lightspeed Venture Partners and Menlo Ventures, among others.

Jun
13
2019
--

Percona Server for MongoDB 4.0.10-5 Now Available

Percona Server for MongoDB

Percona Server for MongoDB

Percona announces the release of Percona Server for MongoDB 4.0.10-5 on June 13, 2019. Download the latest version from the Percona website or the Percona software repositories.

Percona Server for MongoDB is an enhanced, open source, and highly-scalable database that is a fully-compatible, drop-in replacement for MongoDB 4.0 Community Edition. It supports MongoDB 4.0 protocols and drivers.

Percona Server for MongoDB extends the functionality of the MongoDB 4.0 Community Edition by including the Percona Memory Engine storage engine, encrypted WiredTiger storage engineaudit loggingSASL authenticationhot backups, and enhanced query profilingPercona Server for MongoDB requires no changes to MongoDB applications or code.

Percona Server for MongoDB 4.0.10-5 introduces the support of HashiCorp Vault key management service. For more information, see Data at Rest Encryption in the documentation of Percona Server for MongoDB.

This release includes all features of MongoDB 4.0 Community Edition. Most notable among these are:

Note that the MMAPv1 storage engine is deprecated in MongoDB 4.0 Community Edition.

Percona Server for MongoDB 4.0.10-5 is based on MongoDB 4.0.10.

New Features

The Percona Server for MongoDB 4.0.10-5 release notes are available in the official documentation.

Jun
13
2019
--

IBM, KPMG, Merck, Walmart team up for drug supply chain blockchain pilot

IBM announced its latest blockchain initiative today. This one is in partnership with KPMG, Merk and Walmart to build a drug supply chain blockchain pilot.

These four companies are coming together to help come up with a solution to track certain drugs as they move through a supply chain. IBM is acting as the technology partner, KPMG brings a deep understanding of the compliance issues, Merk is of course a drug company and Walmart would be a drug distributor through its pharmacies and care clinics.

The idea is to give each drug package a unique identifier that you can track through the supply chain from manufacturer to pharmacy to consumer. Seems simple enough, but the fact is that companies are loathe to share any data with one another. The blockchain would provide an irrefutable record of each transaction as the drug moved along the supply chain, giving authorities and participants an easy audit trail.

The pilot is part of a set of programs being conducted by various stakeholders at the request of the FDA. The end goal is to find solutions to help comply with the U.S. Drug Supply Chain Security Act. According to the FDA Pilot Program website, “FDA’s DSCSA Pilot Project Program is intended to assist drug supply chain stakeholders, including FDA, in developing the electronic, interoperable system that will identify and trace certain prescription drugs as they are distributed within the United States.”

IBM hopes that this blockchain pilot will show it can build a blockchain platform or network on top of which other companies can build applications. “The network in this case, would have the ability to exchange information about these pharmaceutical shipments in a way that ensures privacy, but that is validated,” Mark Treshock, global blockchain solutions leader for healthcare and life sciences at IBM told TechCrunch.

He believes that this would help bring companies on board that might be concerned about the privacy of their information in a public system like this, something that drug companies in particular worry about. Trying to build an interoperable system is a challenge, but Treshock sees the blockchain as a tidy solution for this issue.

Some people have said that blockchain is a solution looking for a problem, but IBM has been looking at it more practically, with several real-world projects in production, including one to track leafy greens from field to store with Walmart and a shipping supply chain with Maersk to track shipping containers as they move throughout the world.

Treshock believes the Walmart food blockchain is particularly applicable here and could be used as a template of sorts to build the drug supply blockchain. “It’s very similar, tracking food to tracking drugs, and we are leveraging or adopting the assets that we built for food trust to this problem. We’re taking that platform and adapting it to track pharmaceuticals,” he explained.

Jun
12
2019
--

RealityEngines.AI raises $5.25M seed round to make ML easier for enterprises

RealityEngines.AI, a research startup that wants to help enterprises make better use of AI, even when they only have incomplete data, today announced that it has raised a $5.25 million seed funding round. The round was led by former Google CEO and Chairman Eric Schmidt and Google founding board member Ram Shriram. Khosla Ventures, Paul Buchheit, Deepchand Nishar, Elad Gil, Keval Desai, Don Burnette and others also participated in this round.

The fact that the service was able to raise from this rather prominent group of investors clearly shows that its overall thesis resonates. The company, which doesn’t have a product yet, tells me that it specifically wants to help enterprises make better use of the smaller and noisier data sets they have and provide them with state-of-the-art machine learning and AI systems that they can quickly take into production. It also aims to provide its customers with systems that can explain their predictions and are free of various forms of bias, something that’s hard to do when the system is essentially a black box.

As RealityEngines CEO Bindu Reddy, who was previously the head of products for Google Apps, told me, the company plans to use the funding to build out its research and development team. The company, after all, is tackling some of the most fundamental and hardest problems in machine learning right now — and that costs money. Some, like working with smaller data sets, already have some available solutions like generative adversarial networks that can augment existing data sets and that RealityEngines expects to innovate on.

Reddy is also betting on reinforcement learning as one of the core machine learning techniques for the platform.

Once it has its product in place, the plan is to make it available as a pay-as-you-go managed service that will make machine learning more accessible to large enterprise, but also to small and medium businesses, which also increasingly need access to these tools to remain competitive.

Jun
12
2019
--

Helium launches $51M-funded ‘LongFi’ IoT alternative to cellular

With 200X the range of Wi-Fi at 1/1000th of the cost of a cellular modem, Helium’s “LongFi” wireless network debuts today. Its transmitters can help track stolen scooters, find missing dogs via IoT collars and collect data from infrastructure sensors. The catch is that Helium’s tiny, extremely low-power, low-data transmission chips rely on connecting to P2P Helium Hotspots people can now buy for $495. Operating those hotspots earns owners a cryptocurrency token Helium promises will be valuable in the future…

The potential of a new wireless standard has allowed Helium to raise $51 million over the past few years from GV, Khosla Ventures and Marc Benioff, including a new $15 million Series C round co-led by Union Square Ventures and Multicoin Capital. That’s in part because one of Helium’s co-founders is Napster inventor Shawn Fanning. Investors are betting that he can change the tech world again, this time with a wireless protocol that like Wi-Fi and Bluetooth before it could unlock unique business opportunities.

Helium already has some big partners lined up, including Lime, which will test it for tracking its lost and stolen scooters and bikes when they’re brought indoors, obscuring other connectivity, or their battery is pulled, out deactivating GPS. “It’s an ultra low-cost version of a LoJack” Helium CEO Amir Haleem says.

InvisiLeash will partner with it to build more trackable pet collars. Agulus will pull data from irrigation valves and pumps for its agriculture tech business. Nestle will track when it’s time to refill water in its ReadyRefresh coolers at offices, and Stay Alfred will use it to track occupancy status and air quality in buildings. Haleem also imagines the tech being useful for tracking wildfires or radiation.

Haleem met Fanning playing video games in the 2000s. They teamed up with Fanning and Sproutling baby monitor (sold to Mattel) founder Chris Bruce in 2013 to start work on Helium. They foresaw a version of Tile’s trackers that could function anywhere while replacing expensive cell connections for devices that don’t need high bandwith. Helium’s 5 kilobit per second connections will compete with SigFox, another lower-power IoT protocol, though Haleem claims its more centralized infrastructure costs are prohibitive. It’s also facing off against Nodle, which piggybacks on devices’ Bluetooth hardware. Lucky for Helium, on-demand rental bikes and scooters that are perfect for its network have reached mainstream popularity just as Helium launches six years after its start.

Helium says it already pre-sold 80% of its Helium Hotspots for its first market in Austin, Texas. People connect them to their Wi-Fi and put it in their window so the devices can pull in data from Helium’s IoT sensors over its open-source LongFi protocol. The hotspots then encrypt and send the data to the company’s cloud that clients can plug into to track and collect info from their devices. The Helium Hotspots only require as much energy as a 12-watt LED light bulb to run, but that $495 price tag is steep. The lack of a concrete return on investment could deter later adopters from buying the expensive device.

Only 150-200 hotspots are necessary to blanket a city in connectivity, Haleem tells me. But because they need to be distributed across the landscape, so a client can’t just fill their warehouse with the hotspots, and the upfront price is expensive for individuals, Helium might need to sign up some retail chains as partners for deployment. As Haleem admits, “The hard part is the education.” Making hotspot buyers understand the potential (and risks) while demonstrating the opportunities for clients will require a ton of outreach and slick marketing.

Without enough Helium Hotspots, the Helium network won’t function. That means this startup will have to simultaneously win at telecom technology, enterprise sales and cryptocurrency for the network to pan out. As if one of those wasn’t hard enough.

Jun
12
2019
--

Apollo raises $22M for its GraphQL platform

Apollo, a San Francisco-based startup that provides a number of developer and operator tools and services around the GraphQL query language, today announced that it has raised a $22 million growth funding round co-led by Andreessen Horowitz and Matrix Partners. Existing investors Trinity Ventures and Webb Investment Network also participated in this round.

Today, Apollo is probably the biggest player in the GraphQL ecosystem. At its core, the company’s services allow businesses to use the Facebook -incubated GraphQL technology to shield their developers from the patchwork of legacy APIs and databases as they look to modernize their technology stacks. The team argues that while REST APIs that talked directly to other services and databases still made sense a few years ago, it doesn’t anymore now that the number of API endpoints keeps increasing rapidly.

Apollo replaces this with what it calls the Data Graph. “There is basically a missing piece where we think about how people build apps today, which is the piece that connects the billions of devices out there,” Apollo co-founder and CEO Geoff Schmidt told me. “You probably don’t just have one app anymore, you probably have three, for the web, iOS and Android . Or maybe six. And if you’re a two-sided marketplace you’ve got one for buyers, one for sellers and another for your ops team.”

Managing the interfaces between all of these apps quickly becomes complicated and means you have to write a lot of custom code for every new feature. The promise of the Data Graph is that developers can use GraphQL to query the data in the graph and move on, all without having to write the boilerplate code that typically slows them down. At the same time, the ops teams can use the Graph to enforce access policies and implement other security features.

“If you think about it, there’s a lot of analogies to what happened with relational databases in the ’80s,” Schmidt said. “There is a need for a new layer in the stack. Previously, your query planner was a human being, not a piece of software, and a relational database is a piece of software that would just give you a database. And you needed a way to query that database, and that syntax was called SQL.”

Geoff Schmidt, Apollo CEO, and Matt DeBergalis, CTO

GraphQL itself, of course, is open source. Apollo is now building a lot of the proprietary tools around this idea of the Data Graph that make it useful for businesses. There’s a cloud-hosted graph manager, for example, that lets you track your schema, as well as a dashboard to track performance, as well as integrations with continuous integration services. “It’s basically a set of services that keep track of the metadata about your graph and help you manage the configuration of your graph and all the workflows and processes around it,” Schmidt said.

The development of Apollo didn’t come out of nowhere. The founders previously launched Meteor, a framework and set of hosted services that allowed developers to write their apps in JavaScript, both on the front-end and back-end. Meteor was tightly coupled to MongoDB, though, which worked well for some use cases but also held the platform back in the long run. With Apollo, the team decided to go in the opposite direction and instead build a platform that makes being database agnostic the core of its value proposition.

The company also recently launched Apollo Federation, which makes it easier for businesses to work with a distributed graph. Sometimes, after all, your data lives in lots of different places. Federation allows for a distributed architecture that combines all of the different data sources into a single schema that developers can then query.

Schmidt tells me the company started to get some serious traction last year and by December, it was getting calls from VCs that heard from their portfolio companies that they were using Apollo.

The company plans to use the new funding to build out its technology to scale its field team to support the enterprises that bet on its technology, including the open-source technologies that power both the services.

“I see the Data Graph as a core new layer of the stack, just like we as an industry invested in the relational database for decades, making it better and better,” Schmidt said. “We’re still finding new uses for SQL and that relational database model. I think the Data Graph is going to be the same way.”

Jun
12
2019
--

How to Report Bugs, Improvements, New Feature Requests for Percona Products

report bugs

report bugsClear and structured bug, improvement, and new feature request reports are always helpful for validating and fixing issues. In a few cases, we have received these reports with incomplete information which can cause a delay in the verification of issues. The most effective method to avoid this situation is to ensure complete information about the issue when filing a report.

In this post we will discuss:

  • The best ways to report an issue for Percona products
  • Including a “how to reproduce” test case to verify the issue
  • The purpose of bug/improvement/new feature verification

https://jira.percona.com is the central place to report a bug/improvement/new feature request for all Percona products.

Let’s first discuss a few important entries which you should update when reporting an issue.

Project: The product name for which you wish to report an issue.

Issue Type: Provides options for the type of request. Example: Bug/Improvement/New Feature Request.

Summary: Summary of the issue which will serve as the title. It should be a one-line summary.

Affects Version/s: The version number of the Percona software for which you are reporting an issue.

Description: This field is to describe your issue in detail. Issue description should be clear and concise.

Bug report:

  • Describe the actual issue.
  • Add test case/steps to reproduce the issue if possible.
  • If a crash bug, provide my.cnf and error log as additional information along with the details mentioned in this blog_post.
  • In some cases, the supporting file for bug reports such as a coredump, sqldump, or error log is prohibitively large. For these cases, we have an SFTP server where these files can be uploaded.

      Documentation bug:

  • Provide the documentation link, describe what is wrong, and suggest how the documentation could be improved.

Improvement/New Feature  Request:

  • For new features, describe the need and use case. Include answers to such questions as “What problem it will solve?” and “How will it benefit the users?”
  • In the case of improvements, mention what is problematic with the current behavior. What is your expectation as an improvement in a particular product feature?

Note: When reporting an issue, be sure to remove/replace sensitive information such as IP addresses, usernames, passwords, etc. from the report description and attached files.

Upstream Bugs:

Percona Server for MySQL and Percona Server for MongoDB are patched versions of their upstream codebases. It is possible that the particular issue originated from the upstream version. For these products, it would be helpful if the reporter also checks upstream for the same issue. If issues exist upstream, use the following URLs to report an issue for these respective products.

MySQL: https://bugs.mysql.com
MongoDB: https://jira.mongodb.org

If you are a Percona customer, please file a support request to let us know how the bug affects you.

Purpose of Bug/Improvement/New Feature Request verification

  • Gather the required information from the reporter and identify whether the reported issue is a valid bug/improvement/new feature.
  • For bugs, create a reproducible test case. To effectively address, the bug must be repeatable on demand.

Any incorrect assumptions can break other parts of the code while fixing a particular bug; this is why the verification process is important to identify the exact problem. Another benefit of having a reproducible test case is that it can then be used to verify the fix.

While feature requests and improvements are about ideas on how to improve Percona products, they still need to be verified. We need to ensure that behavior reported as a new feature or improvement:

  • Is not a bug
  • Is not implemented yet
  • For new feature verification, we also check whether there is an existing, different way to achieve the same.

Once bugs, improvement, and new features are validated, the issue status will be “Open” and it will move forward for implementation.

Jun
11
2019
--

WhatsApp is finally going after outside firms that are abusing its platform

WhatsApp has so far relied on past dealings with bad players within its platform to ramp up its efforts to curtail spam and other automated behavior. The Facebook -owned giant has now announced an additional step it plans to take beginning later this year to improve the health of its messaging service: going after those whose mischievous activities can’t be traced within its platform.

The messaging platform, used by more than 1.5 billion users, confirmed on Tuesday that starting December 7 it will start considering signals off its platform to pursue legal actions against those who are abusing its system. The company will also go after individuals who — or firms that — falsely claim to have found ways to cause havoc on the service.

The move comes as WhatsApp grapples with challenges such as spam behavior to push agendas or spread of false information on its messaging service in some markets. “This serves as notice that we will take legal action against companies for which we only have off-platform evidence of abuse if that abuse continues beyond December 7, 2019, or if those companies are linked to on-platform evidence of abuse before that date,” it said in an FAQ post on its site.

A WhatsApp spokesperson confirmed the change to TechCrunch, adding, “WhatsApp was designed for private messaging, so we’ve taken action globally to prevent bulk messaging and enforce limits on how WhatsApp accounts that misuse WhatsApp can be used. We’ve also stepped up our ability to identify abuse, which helps us ban 2 million accounts globally per month.”

Earlier this year, WhatsApp said (PDF) it had built a machine learning system to detect and weed out users who engage in inappropriate behavior such as sending bulk messages or creating multiple accounts with intention to harm the service. The platform said it was able to assess the past dealings with problematics behaviors to ban 20% of bad accounts at the time of registration itself.

But the platform is still grappling to contain abusive behavior, a Reuters report claimed last month. The news agency reported about tools that were readily being sold in India for under $15 that claimed to bypass some of the restrictions that WhatsApp introduced in recent months.

TechCrunch understands that with today’s changes, WhatsApp is going after those same set of bad players. It has already started to send cease and desist letters to marketing companies that claim to abuse WhatsApp in recent months, a person familiar with the matter said.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com