noSQL DB or relational DB for log data

大约 2 年

楼主 (北美华人网)

要做一个project，需要存储user的每个页面操作，比如点击了什么link，看了什么video，多长时间，目的是以后分析user log data，做图表。大家有没有这方面的经验可以分享。因为要存储用户的每个操作，所以会有很多write operations，有人建议用NoSQL，因为structure of data may change，有人建议postgreSQL,说可以存JSON data, can do complex queries，现在的concern是，如果这个软件的本身数据也存在postgreSQL，需要pull data from the DB to show to the users, 同时还要大量地write to the log, will the performance suffer? 另外，以后可能要用AWS，看了它有NOSQL DB-- dynamoDB，这个存储user log，可以吗？看到他是key value-based. 那如果要做其他数据查询（比如，search based on non-key field和aggregate data,会不会有问题？if we store the application data in a relational DB, and the log data in dynamoDB, can we still query the DynamoDB, aggregate data, and generate reports, or do we need to pull the data out from the dynamoDB, and later store it in a relational DB for data analysis (e.g. we may need to join the log data with the actual application data to find some information that''s not available in the url--should we just store the extra information in the log so that we don''t have to join with the actual application data)?
In addition, which is better, MySQL or PostgreSQL for the relational DB?
完全没有经验，希望有经验的人能赐教一下。

Sleepy3824

大约 2 年

这个OBS标准操作不是splunk , datadog, Prometheus ， elasticsearch/kibana 吗。用RDBMS 的话I/O多贵啊。用Dynamodb 的话你得自己加ingress, pipeline, aggregation and visualization, reinvent all the wheels.
再说你已经在AWS上了不是有cloudwatch 吗.

moonbag

大约 2 年

这个OBS标准操作不是splunk , datadog, Prometheus ， elasticsearch/kibana 吗。用RDBMS 的话I/O多贵啊。用Dynamodb 的话你得自己加ingress, pipeline, aggregation and visualization, reinvent all the wheels.
再说你已经在AWS上了不是有cloudwatch 吗.
Sleepy3824 发表于 2023-05-18 18:48

cloudwatch 能customize what to log吗？

psyentistc

大约 2 年

Google是你的好朋友。。。

Sleepy3824

大约 2 年

你从YouTube 里找点splunk 的video 开始吧。
You can also push customized application logs into cloud watch: https://devopscube.com/how-to-setup-and-push-serverapplication-logs-to-aws-cloudwatch/
Whatever you do, do NOT store your log data in the same RDBMS as your application.

sea101

大约 2 年

一楼的问题是需要的数据存log文件，RDBMS还是NoSQL。听上去系统和服务里面没有这些日志数据，需要自己写代码来创建。
无论日志数据写道文件，还是数据库里面，都可以被splunk , datadog这些工具把日志汇总再分析。
写到关系数据库的好处是BI的工具多，容易用SQL查询分析，简单方便。如果你用Java，Log4j可以直接同时写到log文件和数据库里面。重要操作写数据库，日常记录写日志文件。如果你每天的日志文件记录数量不到百万级，写数据库的性能影响不大。数据库有一个限制就是数据库的connection必须建立才能写入。有些系统日志生成的时候数据库连接还没有建立，或者数据库的连接断了也不能写入。dynamoDB通常没有这个问题。可以高并发高可用。cloudwatch尽量不要用来永久保存重要的数据。所有的日志文件和cloudwatch是一样的，可以用来检查某次发生的事件，或者分析某段时间的信息。重要数据如果没有转移到专门的数据库中保存，日志文件都可能丢失或者被无意删掉，不适合作为长期数据保存和分享使用的标准。

huar

大约 2 年

Google是你的好朋友。。。
psyentistc 发表于 2023-05-18 19:12

chatgpt也是你的好朋友

moonbag

大约 2 年

Google是你的好朋友。。。
psyentistc 发表于 2023-05-18 19:12

Google 只能查出一片片零散的信息，并不能和有过这些经验的人相比。

moonbag

大约 2 年

回复 5楼Sleepy3824的帖子
很有用的信息！

darksky01

大约 2 年

你从YouTube 里找点splunk 的video 开始吧。
You can also push customized application logs into cloud watch: https://devopscube.com/how-to-setup-and-push-serverapplication-logs-to-aws-cloudwatch/
Whatever you do, do NOT store your log data in the same RDBMS as your application.
Sleepy3824 发表于 2023-05-18 19:20

splunk 需要花钱的吧，如果他们没有这个budget就没法用splunk.

Sleepy3824

大约 2 年

splunk 需要花钱的吧，如果他们没有这个budget就没法用splunk.
darksky01 发表于 2023-05-18 22:00

Good point. 那就ELK吧，都是open sourced. https://logz.io/learn/complete-guide-elk-stack/amp/
很久之前用SOLR 的时候还submit 过PR to Lucene. 现在应该还在ES的build 里。

dalianyin

大约 2 年

Read不是real time的 Write不需要用任何database 直接选择一个file storage就行可以根据schema 把log file 数据读出来就行
然后用batch processing 把这些file读如到db里面用什么db其实没那么重要重点是读写分离