Jan
15
2019
--

Microsoft continues to build government security credentials ahead of JEDI decision

While the DoD is in the process of reviewing the $10 billion JEDI cloud contract RFPs (assuming the work continues during the government shutdown), Microsoft continues to build up its federal government security bona fides, regardless.

Today the company announced it has achieved the highest level of federal government clearance for the Outlook mobile app, allowing US Government Community Cloud (GCC) High and Department of Defense employees to use the mobile app. This is on top of FedRamp compliance, the company achieved last year.

“To meet the high level of government security and compliance requirements, we updated the Outlook mobile architecture so that it establishes a direct connection between the Outlook mobile app and the compliant Exchange Online backend services using a native Microsoft sync technology and removes middle tier services,” the company wrote in a blog post announcing the update.

The update will allows these highly security-conscious employees to access some of the more recent updates to Outlook Mobile such as the ability to add a comment when canceling an event.

This is in line with government security updates the company made last year. While none of these changes are specifically designed to help win the $10 billion JEDI cloud contract, they certainly help make a case for Microsoft from a technology standpoint

As Microsoft corporate vice president for Azure, Julia White stated in a blog post last year, which we covered, “Moving forward, we are simplifying our approach to regulatory compliance for federal agencies, so that our government customers can gain access to innovation more rapidly,” White wrote at the time. The Outlook Mobile release is clearly in line with that.

Today’s announcement comes after the Pentagon announced just last week that it has awarded Microsoft a separate large contract for $1.7 billion. This involves providing Microsoft Enterprise Services for the Department of Defense (DoD), Coast Guard and the intelligence community, according to a statement from DoD.

All of this comes ahead of decision on the massive $10 billion, winner-take-all cloud contract. Final RFPs were submitted in October and the DoD is expected to make a decision in April. The process has not been without controversy with Oracle and IBM submitting a formal protests even before the RFP deadline — and more recently, Oracle filing a lawsuit alleging the contract terms violate federal procurement laws. Oracle has been particularly concerned that the contract was designed to favor Amazon, a point the DoD has repeatedly denied.

Jan
15
2019
--

Campaign Monitor acquires email enterprise services Sailthru and Liveclicker

CM Group, the organization behind email-centric services like Campaign Monitor and Emma, today announced that it has acquired marketing automation firm Sailthru and the email personalization service Liveclicker. The group did not disclose the acquisition price but noted that the acquisition would bring in about $60 million in additional revenue and 540 new customers, including Bloomberg and Samsung. Both of these acquisitions quietly closed in 2018.

Compared to Sailthru, which had raised a total of about $250 million in venture funding before the acquisition, Liveclicker is a relatively small company that was bootstrapped and never raised any outside funding. Still, Liveclicker managed to attract customers like AT&T, Quicken Loans and TJX Companies by offering them the ability to personalize their email messages and tailor them to their customers.

Sailthru’s product portfolio is also quite a bit broader and includes similar email marketing tools, but also services to personalize mobile and web experiences, as well as tools to predict churn and make other retail-focused predictions.

“Sailthru and Liveclicker are extraordinary technologies capable of solving important marketing problems, and we will be making additional investments in the businesses to further accelerate their growth,” writes Wellford Dillard, CEO of CM Group. “Bringing these brands together makes it possible for us to provide marketers with the ideal solution for their needs as they navigate the complex and rapidly changing environments in which they operate.”

With this acquisition, the CM Group now has 500 employees and 300,000 customers.

Jan
15
2019
--

Customizing Per-Process Metrics in PMM

Process Memory Usage - a filtered graph in PMM

If you have set up per-process metrics in Percona Monitoring and Management, you may have found yourself in need of tuning it further to not only group processes together, but to display some of them in isolation. In this blogpost we will explore how to modify the rules for grouping processes, so that you can make the most out of this awesome PMM integration.

Let’s say you have followed the link above on how to set up the per-process metrics integration on PMM, and you have imported the dashboard to show these metrics. You will see something like the following:

PMM database and system monitoring and management software

This is an internal testing server we use, in which you can see a high number of VBoxHeadless (29) and mysqld (99) processes running. All the metrics in the dashboard will be grouped by the name of the command used. But, what if we want to see metrics for only one of these processes in isolation? As things stand, we will not be able to do so. It may not make sense to do so in a testing environment, but if you are running multiple mysqld processes (or mongos, postgres, etc) bound to different ports, you may want to see metrics for each of them separately.

Modifying the configuration file

Enter all.yaml!

In the process-exporter documentation on using a configuration file, we can see the following:

The general format of the -config.path YAML file is a top-level process_names section, containing a list of name matchers. […] A process may only belong to one group: even if multiple items would match, the first one listed in the file wins.

This means that even if we have two rules that would match a process, only the first one will be taken into account. This will allow us to both list processes by themselves, and not miss any non-grouped process. How? Let’s imagine we have the following processes running:

mysqld --port=1
mysqld --port=2
mysqld --port=3
mysqld --port=4

And we wanted to be able to tell apart the instances running in ports 1 and 2 from the other ones, we could use the following rules:

- name: "mysqld_port_1"
 cmdline:
 - '.*mysqld.*port=1.*'
- name: "mysqld_port_2"
 cmdline:
 - '.*mysqld.*port=2.*'
- name: "{{.Comm}}"
 cmdline:
 - '.+'

In cmdline we will need the regular expression against which to match the process command running. In this case, we made use of the fact that they were using different ports, but any difference in the command strings can be used. The last rule is the one that will default to “anything else” (with the regular expression that matches anything).

The default rule at the end will make sure you don’t miss any other process, so unless you want only some processes metrics collected, you should always have a rule for it.

A real life working example of configuring per-process metrics

In case all these generic information didn’t make much sense, we will present a concrete example, hoping that it will make everything fit together nicely.

In this example we want to have the mysqld instance using the mysql_sandbox16679.sock socket isolated from all the others, and the VM with ID finishing in 97eafa2795da listed by their own. All other processes are to be grouped together by using the basename of the executable.

You can check the output from ps aux to see the full command used. For instance:

shell> ps aux | grep 97eafa2795da
agustin+ 27785  0.7 0.2 5619280 542536 ?      Sl Nov28 228:24 /usr/lib/virtualbox/VBoxHeadless --comment centos_node1_1543443575974_22181 --startvm a0151e29-35dd-4c14-8e37-97eafa2795da --vrde config

So, we can use the following regular expression for it (we use .* to match any string):

.*VBoxHeadless.*97eafa2795da.*

The same applies to the regular expression for the mysqld process.

The configuration file will end up looking like:

shell>  cat /etc/process-exporter/all.yaml
process_names:
 - name: "Custom VBox"
   cmdline:
   - '.*VBoxHeadless.*97eafa2795da.*'
 - name: "Custom MySQL"
   cmdline:
   - '.*mysqld.*mysql_sandbox16679.sock.*'
 - name: "{{.Comm}}"
   cmdline:
   - '.+'

Let’s restart the service, so that new changes apply, and we will check the graphs after five minutes, to see new changes. Note that you may have to reload the page for the changes to apply.

shell> systemctl restart process-exporter

After refreshing, we will see the new list of processes in the drop-down list:

A new list of processes in PMM after filtering

And after we select them, we will be able to see data for those processes in particular:

Thanks to the default configuration at the end, we are still capturing data from all the other mysqld processes. However, they will have their own group, as mentioned before:

System Processes Metrics graph in PMM

 

Jan
15
2019
--

Data management startup Rubrik gets $261M at a $3.3B valuation as it moves into security and compliance

There is a growing demand for stronger security at every point in the IT ecosystem, and today, one of the the more successful enterprise startups to emerge in the last several years is announcing a big round of funding to provide that.

Rubrik, which provides enterprise data management and backup services across on-premise, cloud and hybrid networks, has raised $261 million in funding at a $3.3 billion valuation from Bain Capital Ventures and previous investors Lightspeed Venture Partners, Greylock Partners, Khosla Ventures and IVP. It intends to use the funding to build (and buy) tech to expand deeper into security and compliance services alongside its existing data management products.

“As we have demonstrated leadership in data recovery, our customers have been demanding new products and services from us,” CEO and co-founder Bipul Sinha said in an interview, “so we’ve raised capital to double down on that.”

This Series E brings the total raised by Rubrik to $553 million, and it is a big leap for the company: its last raise of $180 million, in 2017, valued Rubrik at $1.3 billion.

Rubrik is not disclosing any other specific financial numbers with the news — Sinha’s response to the question was that he thinks the valuation jump speaks for itself. He also confirmed the company is not profitable, but intentionally so.

“Our goal is to build a long-term, iconic company, and so we want to become profitable but not at the cost of growth,” he said. “We are leading this market transformation while it continues to grow.”

That market transformation is to provide services — and up to now, specifically data back-up services — for enterprises that operate their networks across a hybrid environment, with data used and stored on premises, in the cloud, and sometimes in multiple clouds.

There are a number of other companies that compete with it in backup including biggies like Druva, CommVault and EMC, but Rubrik was an early mover in identifying a need to backup and provide data recovery across a mix of locations.

Moving into security and compliance is a natural progression for the company.

There has always been a synergy between Rubrik’s core business and security/compliance. Often the need for backup and recovery arises specifically as a result of security breaches or other glitches that result from people accessing data when they are not supposed to, and that issue gets compounded when you have data stored and used across multiple locations.

“The fragmentation across cloud and on-prem services creates issues around security and data management,” Sinha said. “The more fragmentation you have, the more important Rubrik [or other data management services] get.”

Similarly, moving into security and compliance together goes hand-in-hand because both address similar needs at companies to be handling information responsibly. “Security and compliance are joined at the hip from a regulatory perspective,” Sinha said.

Up to now, Rubrik has mostly built its service from the ground up. One notable exception has been that it made an acquisition — its first — last year when it acquired NoSQL data backup specialist Datos IO, which helped Rubrik further expand from appliance-based management to cloud-based.

In the case of adding on more security and compliance offerings, it’s not clear yet whether that will be built organically or via acquisition (and there are indeed a number of security startups out there that could be candidates if it’s the latter).

“Rubrik is fundamentally an innovation driven company,” Sinha said. “We like coherent and consistent architecture. Having said that, as a responsible and ambitious company, we are always looking at the marketplace, at where there are the teams that we can acquire.”

Notably, the company has started to signal its growing interest in this area in recent months. The latest build of its flagship Andes data management platform placed security features center stage, and now we can expect to see more of that.

Existing customer loyalty has always attracted investors to the company, and that’s been the case here, too.

At a time when many tech observers are wondering if we are gearing up for a “winter” in the startup ecosystem — where, in a buoyant climate, investors have gone all-in with perhaps too much exuberance that will not bear out in terms of startups’ actual performance — the thinking is that Rubrik’s track record will help it continue to win business both on its legacy services, and as it ventures into newer areas.

“Rubrik has won the trust and loyalty of large enterprise customers around the globe by offering a simple and reliable solution that solves the challenge of protecting and managing data in a hybrid cloud world,” said Enrique Salem, former CEO at Symantec and Partner at Bain Capital Ventures, in a statement. “Given my experience leading the largest enterprise data protection company, we are confident that Rubrik is positioned to win and be the market leader in enterprise cloud data management.”

Jan
15
2019
--

Smartsheet acquires Slope to help creatives collaborate

Smartsheet, the project management and collaboration tool that went public last April, announced the acquisition of Seattle-based TernPro, Inc., makers of Slope, a collaboration tool designed for sharing creative assets.

The companies did not share the acquisition price.

Bringing Slope into the fold will enable Smartsheet users to share assets like video and photos natively inside the application, and also brings the ability to annotate, comment or approve these assets. Smartsheet sees this native integration through a broad enterprise lens. It might be HR sharing training videos, marketing sharing product photos or construction company employees inspecting a site and sharing photos of a code violation, complete with annotations to point out the problem.

Alan Lepofsky, an analyst at Constellation Research who specializes in collaboration tools in the enterprise, sees this as a significant enhancement to the product. “Smartsheet’s focus is on being more than just project management, but instead helping coordinate end-to-end business processes. Slope is going to allow content to become more of a native part of those processes, rather than people having to switch context to another tool,” he explained.

That last point is particularly important, as today’s collaboration tools, whether Slack or Microsoft Teams or any other similar tool, have been working hard to provide that kind of integration to keep people focused on the task at hand without having to switch applications.

Mike Gotta, a longtime analyst at Gartner, says collaboration that happens within the flow of work can help make employees more productive, but being able to build specific use cases is even more critical. “The collaboration space remains open for innovation and new ways to addressing old challenges. For organizations though, the trick is how to create a collaboration portfolio that balances broad-based foundational investments with the more domain-specific or situational scenarios they might have where this type of use-case driven collaboration can make more sense,” Gotta told TechCrunch.

That is precisely what Smartsheet is trying to achieve with this purchase, giving them the ability to incorporate workflows involving creative assets, whether that’s including all of the documents required to onboard a new employee or a training workflow that includes learning objectives, lesson plans, photos, videos and so forth.

Smartsheet, which launched in 2005, raised more than $113 million before going public last April. The company’s stock price has held up, gaining ground in a volatile stock market. It sits above its launch price of $19.50, closing at $25.24 yesterday.

Slope was founded in 2014 and has raised $1.4 million, according to Crunchbase data. Customers include Microsoft, CBS Sports and the Oakland Athletics baseball team. The company’s employees, including co-founders Dan Bloom and Brian Boschè, have already joined Smartsheet.

Jan
14
2019
--

Upcoming Webinar Thurs 1/17: How to Rock with MyRocks

How to Rock with MyRocks

How to Rock with MyRocksPlease join Percona’s Chief Technology Officer, Vadim Tkachenko, as he presents How to Rock with MyRocks on Thursday, January 17th at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

Register Now

MyRocks is a new storage engine from Facebook and is available in Percona Server for MySQL. In what cases will you want to use it? We will check different workloads and when MyRocks is most suitable for you. Also, as for any new engine, it’s important to set it up and tune it properly. So, we will review the most important settings to pay attention to.

Register for this webinar to learn How to Rock with MyRocks.

Jan
14
2019
--

Should You Use ClickHouse as a Main Operational Database?

ClickHouse as a main operational database

ClickHouse as a main operational databaseFirst of all, this post is not a recommendation but more like a “what if” story. What if we use ClickHouse (which is a columnar analytical database) as our main datastore? Well, typically, an analytical database is not a replacement for a transactional or key/value datastore. However, ClickHouse is super efficient for timeseries and provides “sharding” out of the box (scalability beyond one node).  So can we use it as our main datastore?

Let’s imagine we are running a webservice and provide a public API. Public API as -a-service has become a good business model: examples include social networks like Facebook/Twitter, messaging as a service like Twilio, and even credit card authorization platforms like Marqeta. Let’s also imagine we need to store all messages (SMS messages, email messages, etc) we are sending and allow our customers to get various information about the message. This information can be a mix of analytical (OLAP) queries (i.e. how many messages was send for some time period and how much it cost) and a typical key/value queries like: “return 1 message by the message id”.

Using a columnar analytical database can be a big challenge here. Although such databases can be very efficient with counts and averages, some queries will be slow or simply non existent. Analytical databases are optimized for a low number of slow queries. The most important limitations of the analytical databases are:

  1. Deletes and updates are non-existent or slow
  2. Inserts are efficient for bulk inserts only
  3. No secondary indexes means that point selects (select by ID) tend to be very slow

This is all true for ClickHouse, however, we may be able to live with it for our task.

To simulate text messages I have used ~3 billion of reddit comments (10 years from 2007 to 2017), downloaded from pushshift.io . Vadim published a blog post about analyzing reddit comments with ClickHouse. In my case, I’m using this data as a simulation of text messages, and will show how we can use ClickHouse as a backend for an API.

Loading the JSON data to Clickhouse

I used the following table in Clickhouse to load all data:

CREATE TABLE reddit.rc(
body String,
score_hidden Nullable(UInt8),
archived Nullable(UInt8),
name String,
author String,
author_flair_text Nullable(String),
downs Nullable(Int32),
created_utc UInt32,
subreddit_id String,
link_id Nullable(String),
parent_id Nullable(String),
score Nullable(Int16),
retrieved_on Nullable(UInt32),
controversiality Nullable(Int8),
gilded Nullable(Int8),
id String,
subreddit String,
ups Nullable(Int16),
distinguished Nullable(String),
author_flair_css_class Nullable(String),
stickied Nullable(UInt8),
edited Nullable(UInt8)
) ENGINE = MergeTree() PARTITION BY toYYYYMM(toDate(created_utc)) ORDER BY created_utc ;

Then I used the following command to load the JSON data (downloaded from pushshift.io) to ClickHouse:

$ bzip2 -d -c RC_20*.bz2 | clickhouse-client --input_format_skip_unknown_fields 1 --input_format_allow_errors_num 1000000 -d reddit -n --query="INSERT INTO rc FORMAT JSONEachRow"

The data on disk in ClickHouse is not significantly larger than compressed files, which is great:

#  du -sh /data/clickhouse/data/reddit/rc/
638G    /data/clickhouse/data/reddit/rc/
# du -sh /data/reddit/
404G    /data/reddit/

We have ~4 billion rows:

SELECT
    toDate(min(created_utc)),
    toDate(max(created_utc)),
    count(*)
FROM rc
??toDate(min(created_utc))???toDate(max(created_utc))??????count()??
?               2006-01-01 ?               2018-05-31 ? 4148248585 ?
????????????????????????????????????????????????????????????????????
1 rows in set. Elapsed: 11.554 sec. Processed 4.15 billion rows, 16.59 GB (359.02 million rows/s., 1.44 GB/s.)

The data is partitioned and sorted by created_utc so queries which include created_utc will be able to using partition pruning: therefore skip the not-needed partitions. However, let’s say our API needs to support the following features, which are not common for analytical databases:

  1. Selecting a single comment/message by ID
  2. Retrieving the last 10 or 100 of the messages/comments
  3. Updating a single message in the past (e.g. in the case of messages, we may need to update the final price; in the case of comments, we may need to upvote or downvote a comment)
  4. Deleting messages
  5. Text search

With the latest ClickHouse version, all of these features are available, but some of them may not perform fast enough.

Retrieving a single row in ClickHouse

Again, this is not a typical operation in any analytical database, those databases are simply not optimized for it. ClickHouse does not have secondary indexes, and we are using created_utc as a primary key (sort by). So, selecting a message by just ID will require a full table scan:

SELECT
    id,
    created_utc
FROM rc
WHERE id = 'dbumnpz'
??id????????created_utc??
? dbumnpz ?  1483228800 ?
?????????????????????????
1 rows in set. Elapsed: 18.070 sec. Processed 4.15 billion rows, 66.37 GB (229.57 million rows/s., 3.67 GB/s.)

Only if we know the timestamp (created_utc)… Then it will be lighting fast: ClickHouse will use the primary key:

SELECT *
FROM rc
WHERE (id = 'dbumnpz') AND (created_utc = 1483228800)
...
1 rows in set. Elapsed: 0.010 sec. Processed 8.19 thousand rows, 131.32 KB (840.27 thousand rows/s., 13.47 MB/s.)

Actually, we can simulate an additional index set by creating a materialized view in ClickHouse:

create materialized view rc_id_v
ENGINE MergeTree() PARTITION BY toYYYYMM(toDate(created_utc)) ORDER BY (id)
POPULATE AS SELECT id, created_utc from rc;

Here I’m creating a materialized view and populating it initially from the main (rc) table. The view will be updated automatically when there are any inserts into table reddit.rc. The view is actually another MergeTree table sorted by id. Now we can use this query:

SELECT *
FROM rc
WHERE (id = 'dbumnpz') AND (created_utc =
(
    SELECT created_utc
    FROM rc_id_v
    WHERE id = 'dbumnpz'
))
...
1 rows in set. Elapsed: 0.053 sec. Processed 8.19 thousand rows, 131.32 KB (153.41 thousand rows/s., 2.46 MB/s.)

This is a single query which will join our materialized view to pass the created_utc (timestamp) to the original table. It is a little bit slower but still less than 100ms response time.

Using this trick (materialized views) we can potentially simulate other indexes.

Retrieving the last 10 messages

This is where ClickHouse is not very efficient. Let’s say we want to retrieve the last 10 comments:

SELECT
    id,
    created_utc
FROM rc
ORDER BY created_utc DESC
LIMIT 10
??id????????created_utc??
? dzwso7l ?  1527811199 ?
? dzwso7j ?  1527811199 ?
? dzwso7k ?  1527811199 ?
? dzwso7m ?  1527811199 ?
? dzwso7h ?  1527811199 ?
? dzwso7n ?  1527811199 ?
? dzwso7o ?  1527811199 ?
? dzwso7p ?  1527811199 ?
? dzwso7i ?  1527811199 ?
? dzwso7g ?  1527811199 ?
?????????????????????????
10 rows in set. Elapsed: 24.281 sec. Processed 4.15 billion rows, 82.96 GB (170.84 million rows/s., 3.42 GB/s.)

In a conventional relational database (like MySQL) this can be done by reading a btree index sequentially from the end, as the index is sorted (like “tail” command on linux). In a partitioned massively parallel database system, the storage format and sorting algorithm may not be optimized for that operation as we are reading multiple partitions in parallel. Currently, an issue has been opened to make the “tailing” based on the primary key much faster: slow order by primary key with small limit on big data. As a temporary workaround we can do something like this:

SELECT count()
FROM rc
WHERE (created_utc > (
(
    SELECT max(created_utc)
    FROM rc
) - ((60 * 60) * 24))) AND (subreddit = 'programming')
??count()??
?    1248 ?
???????????
1 rows in set. Elapsed: 4.510 sec. Processed 3.05 million rows, 56.83 MB (675.38 thousand rows/s., 12.60 MB/s.) ```

It is still a five seconds query. Hopefully, this type of query will become faster in ClickHouse.

Updating / deleting data in ClickHouse

The latest ClickHouse version allows running update/delete in the form of “ALTER TABLE .. UPDATE / DELETE” (it is called mutations in ClickHouse terms). For example, we may want to upvote a specific comment.

SELECT score
FROM rc_2017
WHERE (id = 'dbumnpz') AND (created_utc =
(
    SELECT created_utc
    FROM rc_id_v
    WHERE id = 'dbumnpz'
))
??score??
?     2 ?
?????????
1 rows in set. Elapsed: 0.048 sec. Processed 8.19 thousand rows, 131.08 KB (168.93 thousand rows/s., 2.70 MB/s.)
:) alter table rc_2017 update score = score +1 where id =  'dbumnpz' and created_utc = (select created_utc from rc_id_v where id =  'dbumnpz');
ALTER TABLE rc_2017
    UPDATE score = score + 1 WHERE (id = 'dbumnpz') AND (created_utc =
    (
        SELECT created_utc
        FROM rc_id_v
        WHERE id = 'dbumnpz'
    ))
Ok.
0 rows in set. Elapsed: 0.052 sec.

“Mutation” queries will return immediately and will be executed asynchronously. We can see the progress by reading from the system.mutations table:

select * from system.mutations\G
SELECT *
FROM system.mutations
Row 1:
??????
database:                   reddit
table:                      rc_2017
mutation_id:                mutation_857.txt
command:                    UPDATE score = score + 1 WHERE (id = 'dbumnpz') AND (created_utc = (SELECT created_utc FROM reddit.rc_id_v  WHERE id = 'dbumnpz'))
create_time:                2018-12-27 22:22:05
block_numbers.partition_id: ['']
block_numbers.number:       [857]
parts_to_do:                0
is_done:                    1
1 rows in set. Elapsed: 0.002 sec.

Now we can try deleting comments that have been marked for deletion (body showing “[deleted]”):

ALTER TABLE rc_2017
    DELETE WHERE body = '[deleted]'
Ok.
0 rows in set. Elapsed: 0.002 sec.
:) select * from system.mutations\G
SELECT *
FROM system.mutations
...
Row 2:
??????
database:                   reddit
table:                      rc_2017
mutation_id:                mutation_858.txt
command:                    DELETE WHERE body = '[deleted]'
create_time:                2018-12-27 22:41:01
block_numbers.partition_id: ['']
block_numbers.number:       [858]
parts_to_do:                64
is_done:                    0
2 rows in set. Elapsed: 0.017 sec.

After a while, we can do the count again:

:) select * from system.mutations\G
SELECT *
FROM system.mutations
...
Row 2:
??????
database:                   reddit
table:                      rc_2017
mutation_id:                mutation_858.txt
command:                    DELETE WHERE body = '[deleted]'
create_time:                2018-12-27 22:41:01
block_numbers.partition_id: ['']
block_numbers.number:       [858]
parts_to_do:                0
is_done:                    1

As we can see our “mutation” is done.

Text analysis

ClickHouse does not offer full text search, however we can use some text functions. In my previous blog post about ClickHouse I used it to find the most popular wikipedia page of the month. This time I’m trying to find the news keywords of the year using all reddit comments: basically I’m calculating the most frequently used new words for the specific year (algorithm based on an article about finding trending topics using Google Books n-grams data). To do that I’m using the ClickHouse function alphaTokens(body) which will split the “body” field into words. From there, I can count the words or use arrayJoin to create a list (similar to MySQL’s group_concat function). Here is the example:

First I created a table word_by_year_news:

create table word_by_year_news ENGINE MergeTree() PARTITION BY y ORDER BY (y) as
select a.w as w, b.y as y, sum(a.occurrences)/b.total as ratio from
(
select
 lower(arrayJoin(alphaTokens(body))) as w,
 toYear(toDate(created_utc)) as y,
 count() as occurrences
from rc
where body <> '[deleted]'
and created_utc < toUnixTimestamp('2018-01-01 00:00:00')
and created_utc >= toUnixTimestamp('2007-01-01 00:00:00')
and subreddit in ('news', 'politics', 'worldnews')
group by w, y
having length(w) > 4
) as a
ANY INNER JOIN
(
select
 toYear(toDate(created_utc)) as y,
 sum(length(alphaTokens(body))) as total
from rc
where body <> '[deleted]'
and subreddit in ('news', 'politics', 'worldnews')
and created_utc < toUnixTimestamp('2018-01-01 00:00:00')
and created_utc >= toUnixTimestamp('2007-01-01 00:00:00')
group by y
) AS b
ON a.y = b.y
group by
  a.w,
  b.y,
  b.total;
0 rows in set. Elapsed: 787.032 sec. Processed 7.35 billion rows, 194.32 GB (9.34 million rows/s., 246.90 MB/s.)

This will store all frequent words (I’m filtering by subreddits; the examples are: “news, politics and worldnews” or “programming”) as well as its occurrence this year; actually I want to store “relative” occurrence which is called “ratio” above: for each word I divide its occurrence by the number of total words this year (this is needed as the number of comments grows significantly year by year).

Now we can actually calculate the words of the year:

SELECT
    groupArray(w) as words,
    y + 1 as year
FROM
(
    SELECT
        w,
        CAST((y - 1) AS UInt16) AS y,
        ratio AS a_ratio
    FROM word_by_year_news
    WHERE ratio > 0.00001
) AS a
ALL INNER JOIN
(
    SELECT
        w,
        y,
        ratio AS b_ratio
    FROM word_by_year_news
    WHERE ratio > 0.00001
) AS b USING (w, y)
WHERE (y > 0) AND (a_ratio / b_ratio > 3)
GROUP BY y
ORDER BY
    y
LIMIT 100;
10 rows in set. Elapsed: 0.232 sec. Processed 14.61 million rows, 118.82 MB (63.01 million rows/s., 512.29 MB/s.)

And the results are (here I’m grouping words for each year):

For “programming” subreddit:

??year???words??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? 2007 ? ['audio','patents','swing','phones','gmail','opera','devices','phone','adobe','vista','backup','mercurial','mobile','passwords','scala','license','copyright','licenses','photoshop'] ?
? 2008 ? ['webkit','twitter','teacher','android','itunes']                                                                                                                                     ?
? 2009 ? ['downvotes','upvote','drupal','android','upvoted']                                                                                                                                   ?
? 2010 ? ['codecs','imgur','floppy','codec','adobe','android']                                                                                                                                 ?
? 2011 ? ['scala','currency','println']                                                                                                                                                        ?
? 2013 ? ['voting','maven']                                                                                                                                                                    ?
? 2014 ? ['compose','xamarin','markdown','scrum','comic']                                                                                                                                      ?
? 2015 ? ['china','sourceforge','subscription','chinese','kotlin']                                                                                                                             ?
? 2016 ? ['systemd','gitlab','autotldr']                                                                                                                                                       ?
? 2017 ? ['offices','electron','vscode','blockchain','flash','collision']                                                                                                                      ?
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

For news subreddit:

??year???words???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
? 2008 ? ['michigan','delegates','obama','alaska','georgia','russians','hamas','biden','hussein','barack','elitist','mccain']                                                                                                                                       ?
? 2009 ? ['stimulus','reform','medicare','franken','healthcare','payer','insurance','downvotes','hospitals','patients','option','health']                                                                                                                           ?
? 2010 ? ['blockade','arizona']                                                                                                                                                                                                                                     ?
? 2011 ? ['protests','occupy','romney','weiner','protesters']                                                                                                                                                                                                       ?
? 2012 ? ['santorum','returns','martin','obamacare','romney']                                                                                                                                                                                                       ?
? 2013 ? ['boston','chemical','surveillance']                                                                                                                                                                                                                       ?
? 2014 ? ['plane','poland','radar','subreddits','palestinians','putin','submission','russia','automoderator','compose','rockets','palestinian','hamas','virus','removal','russians','russian']                                                                      ?
? 2015 ? ['refugees','refugee','sanders','debates','hillary','removal','participating','removed','greece','clinton']                                                                                                                                                ?
? 2016 ? ['morons','emails','opponent','establishment','trump','reply','speeches','presidency','clintons','electoral','donald','trumps','downvote','november','subreddit','shill','domain','johnson','classified','bernie','nominee','users','returns','primaries','foundation','voters','autotldr','clinton','email','supporter','election','feedback','clever','leaks','accuse','candidate','upvote','rulesandregs','convention','conduct','uncommon','server','trolls','supporters','hillary'] ?
? 2017 ? ['impeached','downvotes','monitored','accusations','alabama','violation','treason','nazis','index','submit','impeachment','troll','collusion','bannon','neutrality','permanent','insults','violations']                                                    ?
?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Conclusion

ClickHouse is a great massively parallel analytical system. It is extremely efficient and can potentially (with some hacks) be used as a main backend database powering a public API gateway serving both realtime and analytical queries. At the same time, it was not originally designed that way. Let me know in the comments if you are using ClickHouse for this or similar projects.


Photo by John Baker on Unsplash

 

 

Jan
14
2019
--

Salesforce Commerce Cloud updates keep us shopping with AI-fueled APIs

As people increasingly use their mobile phones and other devices to shop, it has become imperative for vendors to improve the shopping experience, making it as simple as possible, given the small footprint. One way to do that is using artificial intelligence. Today, Salesforce announced some AI-enhanced APIs designed to keep us engaged as shoppers.

For starters, the company wants to keep you shopping. That means providing an intelligent recommendation engine. If you searched for a particular jacket, you might like these similar styles, or this scarf and gloves. That’s fairly basic as shopping experiences go, but Salesforce didn’t stop there. It’s letting developers embed this ability to recommend products in any app, whether that’s maps, social or mobile.

That means shopping recommendations could pop up anywhere developers think it makes sense, like on your maps app. Whether consumers see this as a positive thing, Salesforce says when you add intelligence to the shopping experience, it increases sales anywhere from 7-16 percent, so however you feel about it, it seems to be working.

The company also wants to make it simple to shop. Instead of entering a multi-faceted search, as has been the traditional way of shopping in the past — footwear, men’s, sneakers, red — you can take a picture of a sneaker (or anything you like) and the visual search algorithm should recognize it and make recommendations based on that picture. It reduces data entry for users, which is typically a pain on the mobile device, even if it has been simplified by checkboxes.

Salesforce has also made inventory availability as a service, allowing shoppers to know exactly where the item they want is available in the world. If they want to pick up in-store that day, it shows where the store is on a map and could even embed that into your ridesharing app to indicate exactly where you want to go. The idea is to create this seamless experience between consumer desire and purchase.

Finally, Salesforce has added some goodies to make developers happy, too, including the ability to browse the Salesforce API library and find the ones that make the most sense for what they are creating. This includes code snippets to get started. It may not seem like a big deal, but as companies the size of Salesforce increase their API capabilities (especially with the MuleSoft acquisition), it’s harder to know what’s available. The company has also created a sandboxing capability to let developers experiment and build capabilities with these APIs in a safe way.

The basis of Commerce Cloud is Demandware, the company Salesforce acquired two years ago for $2.8 billion. Salesforce’s intelligence platform is called Einstein. In spite of its attempt to personify the technology, it’s really about bringing artificial intelligence across the Salesforce platform of products, as it has with today’s API announcements.

Jan
14
2019
--

Talking Drupal #193 – Maintaining a Drupal Website

In episode #193 we talk about tasks for maintaining a Drupal website. We also get an update from the Drupal Association.  www.talkingdrupal.com/193

Topics 

  • Update from the Drupal Association with Tim Lennen
  • What do we mean by maintaining? Technology & Functionality
  • Hourly, Daily, Weekly, Monthly, and Quarterly
  • Functionality

Resources

Log Noise

Link Checker

Hosts

Stephen Cross – www.ParallaxInfoTech.com @stephencross

John Picozzi – www.oomphinc.com @johnpicozzi

Nic Laflin – www.nLighteneddevelopment.com @nicxvan

Jan
11
2019
--

AWS Aurora MySQL – HA, DR, and Durability Explained in Simple Terms

It’s a few weeks after AWS re:Invent 2018 and my head is still spinning from all of the information released at this year’s conference. This year I was able to enjoy a few sessions focused on Aurora deep dives. In fact, I walked away from the conference realizing that my own understanding of High Availability (HA), Disaster Recovery (DR), and Durability in Aurora had been off for quite a while. Consequently, I decided to put this blog out there, both to collect the ideas in one place for myself, and to share them in general. Unlike some of our previous blogs, I’m not focused on analyzing Aurora performance or examining the architecture behind Aurora. Instead, I want to focus on how HA, DR, and Durability are defined and implemented within the Aurora ecosystem.  We’ll get just deep enough into the weeds to be able to examine these capabilities alone.

introducing the aurora storage engine 1

Aurora MySQL – What is it?

We’ll start with a simplified discussion of what Aurora is from a very high level.  In its simplest description, Aurora MySQL is made up of a MySQL-compatible compute layer and a multi-AZ (multi availability zone) storage layer. In the context of an HA discussion, it is important to start at this level, so we understand the redundancy that is built into the platform versus what is optional, or configurable.

Aurora Storage

The Aurora Storage layer presents a volume to the compute layer. This volume is built out in 10GB increments called protection groups.  Each protection group is built from six storage nodes, two from each of three availability zones (AZs).  These are represented in the diagram above in green.  When the compute layer—represented in blue—sends a write I/O to the storage layer, the data gets replicated six times across three AZs.

Durable by Default

In addition to the six-way replication, Aurora employs a 4-of-6 quorum for all write operations. This means that for each commit that happens at the database compute layer, the database node waits until it receives write acknowledgment from at least four out of six storage nodes. By receiving acknowledgment from four storage nodes, we know that the write has been saved in at least two AZs.  The storage layer itself has intelligence built-in to ensure that each of the six storage nodes has a copy of the data. This does not require any interaction with the compute tier. By ensuring that there are always at least four copies of data, across at least two datacenters (AZs), and ensuring that the storage nodes are self-healing and always maintain six copies, it can be said that the Aurora Storage platform has the characteristic of Durable by Default.  The Aurora storage architecture is the same no matter how large or small your Aurora compute architecture is.

One might think that waiting to receive four acknowledgments represents a lot of I/O time and is therefore an expensive write operation.  However, Aurora database nodes do not behave the way a typical MySQL database instance would. Some of the round-trip execution time is mitigated by the way in which Aurora MySQL nodes write transactions to disk. For more information on exactly how this works, check out Amazon Senior Engineering Manager, Kamal Gupta’s deep-dive into Aurora MySQL from AWS re:Invent 2018.

HA and DR Options

While durability can be said to be a default characteristic to the platform, HA and DR are configurable capabilities. Let’s take a look at some of the HA and DR options available. Aurora databases are deployed as members of an Aurora DB Cluster. The cluster configuration is fairly flexible. Database nodes are given the roles of either Writer or Reader. In most cases, there will only be one Writer node. The Reader nodes are known as Aurora Replicas. A single Aurora Cluster may contain up to 15 Aurora Replicas. We’ll discuss a few common configurations and the associated levels of HA and DR which they provide. This is only a sample of possible configurations: it is not meant to represent an exhaustive list of the possible configuration options available on the Aurora platform.

Single-AZ, Single Instance Deployment

great durability with Aurora but DA and HA less so

The most basic implementation of Aurora is a single compute instance in a single availability zone. The compute instance is monitored by the Aurora Cluster service and will be restarted if the database instance or compute VM has a failure. In this architecture, there is no redundancy at the compute level. Therefore, there is no database level HA or DR. The storage tier provides the same high level of durability described in the sections above. The image below is a view of what this configuration looks like in the AWS Console.

Single-AZ, Multi-Instance

Introducing HA into an Amazon Aurora solutionHA can be added to a basic Aurora implementation by adding an Aurora Replica.  We increase our HA level by adding Aurora Replicas within the same AZ. If desired, the Aurora Replicas can be used to also service some of the read traffic for the Aurora Cluster. This configuration cannot be said to provide DR because there are no database nodes outside the single datacenter or AZ. If that datacenter were to fail, then database availability would be lost until it was manually restored in another datacenter (AZ). It’s important to note that while Aurora has a lot of built-in automation, you will only benefit from that automation if your base configuration facilitates a path for the automation to follow. If you have a single-AZ base deployment, then you will not have the benefit of automated Multi-AZ availability. However, as in the previous case, durability remains the same. Again, durability is a characteristic of the storage layer. The image below is a view of what this configuration looks like in the AWS Console. Note that the Writer and Reader are in the same AZ.

Multi-AZ Options

Partial disaster recovery with Amazon auroraBuilding on our previous example, we can increase our level of HA and add partial DR capabilities to the configuration by adding more Aurora Replicas. At this point we will add one additional replica in the same AZ, bringing the local AZ replica count to three database instances. We will also add one replica in each of the two remaining regional AZs. Aurora provides the option to configure automated failover priority for the Aurora Replicas. Choosing your failover priority is best defined by the individual business needs. That said, one way to define the priority might be to set the first failover to the local-AZ replicas, and subsequent failover priority to the replicas in the other AZs. It is important to remember that AZs within a region are physical datacenters located within the same metro area. This configuration will provide protection for a disaster localized to the datacenter. It will not, however, provide protection for a city-wide disaster. The image below is a view of what this configuration looks like in the AWS Console. Note that we now have two Readers in the same AZ as the Writer and two Readers in two other AZs.

Cross-Region Options

The three configuration types we’ve discussed up to this point represent configuration options available within an AZ or metro area. There are also options available for cross-region replication in the form of both logical and physical replication.

Logical Replication

Aurora supports replication to up to five additional regions with logical replication.  It is important to note that, depending on the workload, logical replication across regions can be notably susceptible to replication lag.

Physical Replication

Durability, High Availability and Disaster Recovery with Amazon AuroraOne of the many announcements to come out of re:Invent 2018 is a product called Aurora Global Database. This is Aurora’s implementation of cross-region physical replication. Amazon’s published details on the solution indicate that it is storage level replication implemented on dedicated cross-region infrastructure with sub-second latency. In general terms, the idea behind a cross-region architecture is that the second region could be an exact duplicate of the primary region. This means that the primary region can have up to 15 Aurora Replicas and the secondary region can also have up to 15 Aurora Replicas. There is one database instance in the secondary region in the role of writer for that region. This instance can be configured to take over as the master for both regions in the case of a regional failure. In this scenario the secondary region becomes primary, and the writer in that region becomes the primary database writer. This configuration provides protection in the case of a regional disaster. It’s going to take some time to test this, but at the moment this architecture appears to provide the most comprehensive combination of Durability, HA, and DR. The trade-offs have yet to be thoroughly explored.

Multi-Master Options

Amazon is in the process of building out a new capability called Aurora Multi-Master. Currently, this feature is in preview phase and has not been released for general availability. While there were a lot of talks at re:Invent 2018 which highlighted some of the components of this feature, there is still no affirmative date for release. Early analysis points to the feature being localized to the AZ. It is not known if cross-region Multi-Master will be supported, but it seems unlikely.

Summary

As a post re:Invent takeaway, what I learned was that there is an Aurora configuration to fit almost any workload that requires strong performance behind it. Not all heavy workloads also demand HA and DR. If this describes one of your workloads, then there is an Aurora configuration that fits your needs. On the flip side, it is also important to remember that while data durability is an intrinsic quality of Aurora, HA and DR are not. These are completely configurable. This means that the Aurora architect in your organization must put thought and due diligence into the way they design your Aurora deployment. While we all need to be conscious of costs, don’t let cost consciousness become a blinder to reality. Just because your environment is running in Aurora does not mean you automatically have HA and DR for your database. In Aurora, HA and DR are configuration options, and just like the on-premise world, viable HA and DR have additional costs associated with them.

For More Information See Also:

 

 

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com