Topics

Major issues with connecting to cloud using TLS


yitzchak@...
 

I am currently trying to get a setup working with OCF devices connecting to the cloud using TLS. I found the following issues:
1. There is a easy to reproduce deadlock (reproducable via the airconditioner examples).since there is a lock on the ssl resource (g_sslContextMutex in ca_adapter_net_ssl.c), and on the tcp resource (g_mutexObjectList in catcpserver.c). When a message is sent (ssl is locked and then tcp) and another received (tcp is locked and then ssl) around the same time, there is a deadlock.
2. I tried breaking up the deadlock by naively unifying the locks. This causes a major slowdown, especially if the connections fail. The reason for this is that the ssl resource lock holds locks around connection related events which can take a long time. Usually threads should not be waiting for locks on network events!
3. I then tried to modify the way the ssl resource lock works, and to have it not lock around connection related events. It still didn't help because it turns out that the way the ssl handshake implemented in a seemingly strange way where it is created from jumping around between different parts of the code which happen to only work if the locks are implemented as there were.

Another issue that I found is when I run a OCF server more than once, the function "OCSaveTrustCertChain" used to register a certificate will just continue to a dd the certificate to the secure db, causing the file to grow, and worse, causing resource discovery on an ocf server to stop working.

My conclusion from this is that TLS connection with cloud is utterly broken.I would like to hear some input on this issue, hopefully proving me wrong.
I can provide more details as needed.


Mats Wichmann
 

On 08/21/2018 02:50 AM, yitzchak@coapp.co.il wrote:

Another issue that I found is when I run a OCF server more than once, the function "OCSaveTrustCertChain" used to register a certificate will just continue to a dd the certificate to the secure db, causing the file to grow, and worse, causing resource discovery on an ocf server to stop working.
the examples are toys, so to this isn't really a surprise: the "secure
resource" is just an (insecure) file which it wouldn't be on a real
device. I believe the expectation is you'll throw it away and start over
frequently as you test things. Still, if the stack itself is doing
something inappropriate, we should get a bug written up. Maybe the
examples could have a little script that resets the dat files to their
initial state as part of starting up an example?


Gregg Reynolds
 

Have you been following the work Ondrej has been doing? OCF Cloud is up in the air afaik. Pun intended.


On Tue, Aug 21, 2018, 3:50 AM <yitzchak@...> wrote:
I am currently trying to get a setup working with OCF devices connecting to the cloud using TLS. I found the following issues:
1. There is a easy to reproduce deadlock (reproducable via the airconditioner examples).since there is a lock on the ssl resource (g_sslContextMutex in ca_adapter_net_ssl.c), and on the tcp resource (g_mutexObjectList in catcpserver.c). When a message is sent (ssl is locked and then tcp) and another received (tcp is locked and then ssl) around the same time, there is a deadlock.
2. I tried breaking up the deadlock by naively unifying the locks. This causes a major slowdown, especially if the connections fail. The reason for this is that the ssl resource lock holds locks around connection related events which can take a long time. Usually threads should not be waiting for locks on network events!
3. I then tried to modify the way the ssl resource lock works, and to have it not lock around connection related events. It still didn't help because it turns out that the way the ssl handshake implemented in a seemingly strange way where it is created from jumping around between different parts of the code which happen to only work if the locks are implemented as there were.

Another issue that I found is when I run a OCF server more than once, the function "OCSaveTrustCertChain" used to register a certificate will just continue to a dd the certificate to the secure db, causing the file to grow, and worse, causing resource discovery on an ocf server to stop working.

My conclusion from this is that TLS connection with cloud is utterly broken.I would like to hear some input on this issue, hopefully proving me wrong.
I can provide more details as needed.


Nathan Heldt-Sheller
 

Hi folks,

 

On vacation so just a brief response: TLS deadlock is likely caused by a known issue being actively worked (see Jira https://jira.iotivity.org/browse/IOT-3059).  Aleksey Volkov (cc’d, Security Maintainer) can hopefully provide more details (or correct me if this is a different issue).


Thanks,
Nathan

 

From: iotivity-dev@... [mailto:iotivity-dev@...] On Behalf Of Gregg Reynolds
Sent: Tuesday, August 21, 2018 12:20 PM
To: yitzchak@...
Cc: iotivity-dev <iotivity-dev@...>
Subject: Re: [dev] Major issues with connecting to cloud using TLS

 

Have you been following the work Ondrej has been doing? OCF Cloud is up in the air afaik. Pun intended.

On Tue, Aug 21, 2018, 3:50 AM <yitzchak@...> wrote:

I am currently trying to get a setup working with OCF devices connecting to the cloud using TLS. I found the following issues:
1. There is a easy to reproduce deadlock (reproducable via the airconditioner examples).since there is a lock on the ssl resource (g_sslContextMutex in ca_adapter_net_ssl.c), and on the tcp resource (g_mutexObjectList in catcpserver.c). When a message is sent (ssl is locked and then tcp) and another received (tcp is locked and then ssl) around the same time, there is a deadlock.
2. I tried breaking up the deadlock by naively unifying the locks. This causes a major slowdown, especially if the connections fail. The reason for this is that the ssl resource lock holds locks around connection related events which can take a long time. Usually threads should not be waiting for locks on network events!
3. I then tried to modify the way the ssl resource lock works, and to have it not lock around connection related events. It still didn't help because it turns out that the way the ssl handshake implemented in a seemingly strange way where it is created from jumping around between different parts of the code which happen to only work if the locks are implemented as there were.

Another issue that I found is when I run a OCF server more than once, the function "OCSaveTrustCertChain" used to register a certificate will just continue to a dd the certificate to the secure db, causing the file to grow, and worse, causing resource discovery on an ocf server to stop working.

My conclusion from this is that TLS connection with cloud is utterly broken.I would like to hear some input on this issue, hopefully proving me wrong.
I can provide more details as needed.


yitzchak@...
 

> the "secure resource" is just an (insecure) file which it wouldn't be on a real device
What should be used on a real device?


Mats Wichmann
 

On 08/22/2018 02:39 AM, yitzchak@coapp.co.il wrote:
the "secure resource" is just an (insecure) file which it wouldn't be on a real device
What should be used on a real device?
a real device has real secure storage. iotivity uses a "virtual"
resource, which can be backed by a file like in the examples, but in
implementing a device you would have it update the actual storage.


yitzchak@...
 

> in implementing a device you would have it update the actual storage.
Are there any examples for doing this on Android?


Aleksey Volkov
 

Hi,

yes, it is known issue, since at cloud configuration time incoming and outgoing TLS connections occur simultaneously, and g_sslContextMutex and g_mutexObjectList locked in different order for send and receive operations. This is architecture issue and network stack should be refactored... I made the workaround - https://gerrit.iotivity.org/gerrit/25341, it solve a deadlock, but I think this solution isn't quite proper and remains only as workaround. Imho, the simplest solution is to use read/write lock of g_mutexObjectList on network stack layer, since deadlock happens on access to network connection list.
Regarding slowdown issue from item 2 - it's another known issue, network layer operates only with sync calls to sockets. For udp it doesn't create issues, but it's critical for outgoing tcp connection - since all network stack will hang & wait on socket timeout.